Here I go again, ranting about Mac OS X bowels. This time I want to talk about particular implementation details of Mach-O runtime ABI (Application Binary Interface). Before you get too confused, there are two different things under the 'Mach-O' name:
- Mach-O ABI, which defines how every application in the system executes and calls functions (stack conventions, register usage, and more);
- Mach-O File Format, which defines a way for compiled executables to store different parts of them in the same file (compiled code, data, strings, etc).
The latter is not what I want to talk about today; the first is what puzzles me most. I admit I am just a "small programmer" with no relationship with the powers-that-be at Apple at all (this means, no insider contacts who can explain the reasoning behind the particular important design decisions to me), so my impressions, judgements and guesses expressed in this article may be slightly or totally off the mark. I, however, as many other developers who have dug deep into the implementation of such things, can see obvious drawbacks and oddities about Mach-O ABI, and this is what I am going to talk about.
Mach-O originates from NeXTstep, an operating system created at NeXT for its NeXTstation machines, and later expanded to x86 hardware with OpenStep. NeXTstations were originally based on Motorola 68k CPUs, just like old Macintoshes.Mac OS (classic), on the other hand, used an ABI for PowerPC which followed the ABI principles defined in a document by IBM/Moto for PPC processors. So as you all may already know, m68k and x86 are CISC architectures; PowerPC used in all new Macs is RISC.
To make a long story short, Mac OS X uses an ABI designed for CISC processors, mostly ignoring RISC design principles.
What do I mean by that? Mach-O ABI we see now used in Mac OS X is more or less a direct port of NeXT's Mach-O designed for m68k - it relies on PC (program counter) register to perform various manipulations with data (for the geeks: PC-relative addressing). There's nothing wrong with that, as its an effective and common practice, except for one little thing: there is no PC register in RISC processors (programmatically accessible). That is not a show-stopper though - Mach-O for PowerPC just takes one of 32 general purpose registers and turns it into a program counter-style register, to base all offset calculations off it. That works well, as you can see, as all of Mac OS X applications (except for the ones compiled with Carbon/CFM) use the Mach-O ABI.
That approach works well, except for one small thing: global/static data access adds about 7 cycle overhead per function, and about triple of that for cross-context calls (that is for the G4 class processor) compared to the old, Mac OS Classic ABI (excuse me for the geek talk). Mac OS Classic CFM ABI, in comparison, needed almost 0 cycles for static data access and about 5 for cross-context calls. To rephrase - applications in Mac OS X could be faster, if the Mach-O ABI followed the principles set for the PowerPC chip, and not the ones created over a decade ago for CISC ones.
This brings us the question, "how much faster would the applications be if the ABI was done right?". The answer is, according to some tests done by my friends on a Macintosh Development IRC channel, the speed gain would be 10-30%, depending on each particular application (how often does it calls functions). Realistically, the speed gain would be around 10 to 12 per cent (how do I get these numbers, below).
So why did Apple used an outdated ABI for a modern operating system? Frankly, I don't know the reason. About the best one I have heard - it saved Apple a few months in the Mac OS X development time so they didn't have to do massive updates to its NeXT-derived tool chain.
There are signs of change though -- the recent update to GCC, the compiler shipped with OSX, allows it to perform so-called -mdynamic-no-pic optimization, which hard-codes the data addresses in the code, so the result is roughly equivalent to the CFM ABI used in Mac OS Classic -- so the GCC itself, compiled with that optimization, is 10% faster. Applications, to take advantage of that, need to be recompiled, so it doesn't affect 80% of the titles already shipped for Mac OS X. Then again, the optimization above only works for executables and not shared libraries.
Either way, there is no way to change the ABI now, as it would break all of the existing applications - which is obviously not what Apple (or us) would want.
And after all, who cares about a 10% speed loss? You can always get a faster Mac, right?
Further reading:
Mach-O Runtime Architecture [developer.apple.com]
CFM-Based Runtime Architecture [developer.apple.com]
Update 10/21: Warning, even more technical stuff follows!
Somebody on /. pointed out you can easily add a new ABI to the system. While this is true, but if you do it, you run into three problems:
- All current import/exports must go through some kind of mixed mode/thunking glue to map between the ABIs
- You need separate copies of all libraries to link them directly with the ABI native format
- For some plugins, you can't determine what ABI they are compiled with or are expecting to use
The downside is that (1) adds back a lot of the CPU hit that you was trying to get rid of, whereas (2) trades the CPU hit for a lot of extra memory and complexity (ie, having two copies of the code) - and Mac OS X uses already so much memory it is already a problem.
I've also been told OPENstep runs on RISC processors (non-PowerPC) - however, I have not investigated how the Mach-O ABI works there - quite possible it obeys the PowerPC guidelines, although I am pretty sure it does the same as on PowerPC.
Mac OS X was originally not moved to the PPC ABI because the NeXT Toolchain used the Mach-O ABI and it would have delayed the release of OS X a few months. Now Apple is spending a few months to speed up OS X in ways that may not have been necessary if they had gone with the PPC ABI in the first place.
Thanks to the Macintosh Development oriented channel regulars and an anonymous Apple engineer for helping me with this article.
Related:
- Hiya Kids, it's Theming Time! - Oct 06, 2009
- Mighty Mouse with Some Theme Sauce - Jun 02, 2009
- WindowShade X 4.3 - Apr 24, 2009
- Sound of the Underground - Apr 20, 2009
- Welcome back. - Apr 17, 2009
What impresses me about this article is that I was able to follow you. I couldn't begin to explain it to anyone else, but that was insightful stuff m'man.
Posted by: Josh on October 19, 2002 9:39 PMI suspect it may have to do with keeping the OS portable across many chip architectures, as NeXTSTEP and OpenStep were.
Posted by: Mason on October 21, 2002 5:53 AMNice write-up!
Heh.. you've been slashdotted as well.
Posted by: TigerKR on October 21, 2002 9:46 AMYour thoughts and insights are valuable.
Why do you post them in nearly unreadable grey type on a white background?
My eyes hurt after just a few seconds reading your blog.
12% - just think of how many hours i could reclaim with that 12% ;)
Posted by: rstevens on October 21, 2002 10:45 AMWith Apple promoting the use of bundles instead of 1 file monolithic applications, it seems like you could just beef up the application file sizes so they would be self-contained (i.e. have the proper ABI version compliant libraries) and only run the right code. This just means we're going through another "FAT" binary experience except this time it's even easier to pull out your unwanted code as contextually clicking (control-click) on any bundle lets you open the bundle and manipulate the constituent files. With today's huge hard disk sizes, this is probably the quickest and easiest solution.
Posted by: TM Lutas on October 21, 2002 11:51 AMI would agree with Josh on this one. I think this aspect of OSX was intended to align with OpenStep's design goals. That is, portability being of greater importance than performance. I would also agree with the point in the article, stating that it was done initially to get OSX out to market quicker. It was the path of least resistance far Apple, at the time.
Posted by: rory on October 21, 2002 12:01 PMI posted this on slashdot, but figured I'd post it here as well:
It just so happens that I friend of mine has a copy of "PowerPC Mircoprocessor Family: Programming Environments for 32-bit Microprocessors" sitting on his desk, which I grabbed. Here is how PowerPC processors branch (from section 4.2.4.1 of said dead-tree document):
1. Branch relative addressing mode - the immediate displacement operand is sign exteneded and added to the current instruction address to produce the branch target address. So, PC relative addressing. There is no need for a programmatically accessible program counter because this is all done by the branch execution unit. Single 32-bit instruction.
2. Branch conditional to relative addressing mode - same as branch relative addressing, except that the branch is only executed if the proper condition codes are set. Single 32-bit instruction.
3. Branch to absolute addressing - the operand address is sign extended and used as the branch target. As the name implies, this is absolute addressing. Only problem is, the operand address is only 23 bits wide in a 32-bit implementation, and with the zero pad, it gives only 25 bits of absolute address (word alignment required). So, if you absolute address anything, you can only absolute address 25 bits worth of the address space.
4. Branch conditional to absolute - same as regular absolute addressing, except that you have to encode condition codes, so the operand address is nowo only 13 bits if I read the diagrams correctly, meaning that you can only absolutely address 15 bits of address space with the zero pad.
5. Branch conditional to link register - if you clobber the link register, you can branch to a 32-bit address. Of course, you have to clobble the link register, so I would think this would be most helpful in returning from a function call, not going to it, since the link register holds the return address. And if you use it forward instead of returning, you have to load the link register.
6. Branch conditional to count register - same as link register branching as above.
All of that said, the reason that the Mac OS ABI uses PC relative addressing is because the only way to fully address a 32-bit address space is to do PC relative addressing. According to this book, there is no two instruction width branch, eg a branch instruction which encodes an entire 32-bit absolute address in two 32-bit words (one word for branch encoding and condition codes, one word for the whole 32-bit address).
This leads me to believe that there is no way to do all absolute addressing on PowerPC unless you implement new instructions (which will take more time to get to the processor, and to decode) or limit yourself to 15 or 25 bits of the address space.
So, the short version is that that there is no way for the Mac OS ABI to do absolute addressing.
Posted by: nadador on October 21, 2002 12:28 PMSorry to clog the page, but there is this as well. I realize that we're also talking about data addressing, not just branching.
> Its not about branching. Its about data references using PC relative addressing. The PowerPC has no PC relative data addressing modes.
Point taken. Section 4.2.3.1 of the same book is "Integer load and store address generation".
1. Register indirect with immediate index addressing for integer loads and stores - In this case, you get a 16-bit index in the instruction added to the value in a general purpose register, which is used to compute the effective address.
2. Register indirect with index addressing for integer loads and stores - this is the same as above, except that two registers are used and there is no encoded index.
3. Register indirect addressing integer loads and stores - use just one general purpose register as an address for a load or store.
So, the point is that in every case, some form of relative addressing is used. In order to make relocatable code, ie code that can be linked happily with other binary objects, you have to have some sort of reference address, and PC-relative addressing is the only way to do this.
Even though there is no PC-relative addressing mode, the only way to guarentee that the relative addresses used in different object files won't clash is to do PC-relative. The fact that this is not easy on the PowerPC doesn't make it any less necessary.
Posted by: nadador on October 21, 2002 12:41 PMHello,
in response to nadador who said "This leads me to believe that there is no way to do all absolute addressing on PowerPC unless you implement new instructions (which will take more time to get to the processor, and to decode) or limit yourself to 15 or 25 bits of the address space."
I have been programming PowerPC assembly for a few month now, and can assure you that absolute addressing is possible. We do this by first loading a 32 bit value into a register (over two instructions) and then branching to the address in the register. For example, to branch to the label go_here:
lis r4, go_here@h # load top 16 bits into register r4
ori r4, r4, go_here@l # load bottom 16 bits into register r4
b r4 #branch unconditionally to address in r4
ok, so it clobbers a register. life isn't perfect.
-andrew
Posted by: Andrew de los Reyes on October 21, 2002 12:50 PMCool, cool. Yeah I think that I both misunderstood the question and my response. I shouldn't post things without drinking more coffee.
What my point should have been is that if you want relocatable code, you have to do some sort of relative addressing, PC-relative being the most common. And since relocatable code is generally prefereable because it makes linking more fun, its a good thing.
Sorry for all the posting, when just that might have done.
Posted by: nadador on October 21, 2002 1:42 PMIsn't there a solution to this along the lines of the PPC/68K processor switch? Use the native and automatically switch to non-native when necessary? This seems to be what's mentioned in the update, but what's different that this _can't_ be done? Or perhaps, what's different that developers wouldn't want to take advantage of this, as they have for the 68K to PPC switch, or the OS 9 to OS X switch?
Posted by: Bob Terrell on October 21, 2002 1:45 PM>With today's huge hard disk sizes, this is probably the quickest and easiest solution.
That would be nice if it were true, but the concept is NextStep's.
And I do get the creepy feeling about this architecture. It seems to me that some hacker is going to figure out a good way to exploit this someday.
R.
>I suspect it may have to do with keeping the OS portable across many chip architectures, as NeXTSTEP and OpenStep were.
Portability and addressing the specific architecture your system is going to run on... Yes, OS was for a number of platforms, all CISC I think. But this is not supposed to be OpenStep. Remember Portable Data Objects?
Posted by: Rixstep on October 21, 2002 7:34 PMApple should go back to using PEF/CFM.
What's the alternative, creating a 3rd ABI with a separate set of shim libraries?
Crack smokers...
Posted by: strobe on October 21, 2002 7:35 PMIn the past (MacOS
What I just described is the standard PowerPC ABI for handling global data & functions.
Posted by: rincewind on October 21, 2002 7:40 PMWhy create a new ABI at all? Why can't we use the existing PowerPC ABI which from reading the comments uses a Global Pointer (remember the complaints from x86 linux users when everything went ELF cuz it wasted a register for the GP?)
to address global data items.
Personally it seems silly to make everything (even data) PC-relative. Would it be possible to generate new code with the PowerPC ABI, and have the linker handle things that are linked with Mach-O objects? Since we're at it, why can't (I know this is dangerous territory) we change to ELF for new objects, executables, and libraries? And force the linker to play tricks when it has to link against old stuff?
Since there'd still be a need to use old Mach-O objects, is there a way to preserve compatibility?
Or is it a losing battle because of teh complexity of ABI interactions?
I confess I haven't looked too deep at the Mach-O runtime conventions, but I'll remedy that shortly.
(Note: I am a member of this Macintosh Development IRC Channel. I also have a very good guess who the anonymous Apple engineer is.)
This is an email that was distributed around our channel with some more stuff about MachO:
Relatively similiar code is generated by each compiler, but GCC has
subpar code generation compared to CW, which isn't too great itself. MrC
was the best of available PowerPC compilers for the Mac, but is end of
lifed now; however, even that wasn't on par with the optimizer in the
Intel compiler.
Just a quick primer for terminology, just to make sure we're on same
page. PEF and MachO are the binary formats, CFM (Code Fragment
Managemer) and DYLD are the respective loaders. PEF/CFM are often used
interchangably. In PEF, TOC is Table of Contents, RTOC is Register for
Table of Contents (Register 2 is the base pointer). RTOC is a 64k table
of pointers used to get around the lack of a 32 bit load immediate in
PowerPC (also the lack of Program Counter relative addressing). In
MachO, PIC stands for "position-independent code".
PEF has some, but not much, overhead... Omni is NextStep trained, and
seem to be exclusive to that; I don't think they've done much PEF work?
Anyways, on to the numbers.
For a normal macho function call, the PIC can be up to 9 cycles
For a normal CFM function call, the RTOC maintenance is 0 cycles
For a macho cross-module call, you have (on average) an 11 cycle glue
routine which has PIC in it
For a CFM cross-TOC call, you have (on average) a 5 cycle ptr_glue call
Case in point: the _msgSend routine of ObjC is in /usr/lib/libobjc so
every message sent in any piece of ObjC code has that 11 cycle penalty
tacked onto the beginning, and then it gets the 9 cycle PIC penalty once
you actually get there... every object function (method) call in ObjC
goes through that.
Similiar penalties apply to global and static data... MachO will impose a
penalty over PEF/CFM for local computational code:
PIC adds 9 cycles to any function which access global/static data (like
floating point constant, as might be found in a fast Mac
millisecond/microsecond time reader), whereas CFM adds 0 cycles.
These numbers come from a friend of mine, who is an Apple engineer. He
also claims that of Carbon PEF, Carbon MachO, and Cocoa MachO, Carbon PEF
apps are usually the fastest, and provided the following extra example:
iTunes recompiled by GCC/MachO was 30% slower at MP3 decompression than with CodeWarrior/PEF.
Lame MP3 Encoder was 50% faster with CodeWarrior PEF than with GCC/MachO, both on OS X.
(both are Carbon apps)
Calling trough a local function pointers is much faster in MachO.
The code bellow is 20% faster when compiled to Mach0 (5.12 sec for CFM, 4 sec for Mach0).
#include
#include
int TestFunc (int i) { return (i+1); }
int main()
{
int (*func)(int) = &TestFunc;
clock_t start = clock();
for (int i = 0; i
for (int j = 0; j
(*func) (i+j);
}
}
clock_t end = clock ();
std::cout
return 0;
}
I think Apple knows this problem wey well. But the real answer is portability. If Moto has to close /because of loosing lot of money/ its powerpc business, Apple can do nothing than change to x86 proc.
Posted by: Zoltan on October 22, 2002 3:35 AMHas anyone checked out the comparitive cycle counts (is the info even available yet) for PEF v Mach-O accesses on the IBM 970?
It seems, given trends in CPU design, that this situation would only worsen over time.
I doubt portability was really the primary factor in the decision. Apple _had_ to know that the result of this situation would intrinsically cause porters to favor PEF's performance. Since anything that would direct design towards PEF/CFM is automatically non-portable, they would have been explicitly choosing portability at the expense of virtually ALL of their big ISV's apps.
It seems FAR more likely they just didn't have time/resources to make the release schedules if they'd had to change the basic ABI for so much of Rhapsody/OSX's toolchain/frameworks.
Unfortunately, now they appear utterly stuck. The only solutions readily available (thunking, etc.) would all cause significant slowdowns of existing code, Recompiling with a modified toolchain might solve the Apple-internal switchover, but getting ALL existing Mach-O-using ISV's to recompile their commercial apps (using a new, probably buggy, modified toolchain) is a VERY hard sell.
So...where can Apple go from here? Mach-O's performance penalties will likely worsen over time on new PPC CPUs. What can they do?
To the portability folks: _Apple_ might entertain the notion of hauling everyone over to x86 or such at some future point, but unless they can sell it to Adobe, Macromedia, etc. there'd be no point. I cannot see that as something Apple even _could_ sell, given the pain those ISV's would receive in turn from Microsoft. While that portability is a nice rhetorical concept (and occasional 'big stick' in PPC vendor negotiations), it just doesn't seem practically useful. Punishing ALL existing apps for it does not seem like a "good" strategy for Apple in the long run.
Posted by: JFW on October 22, 2002 10:54 AMMiklos,
I think your right, in that calling through a function pointer should be faster on MachO than CFM (it doesn't matter if it's local or not, the compiler will call through the function pointer as if it is a non-local function call).
For Cross-TOC call in CFM, you have to setup the RTOC before you call the function. MachO should just be a jump to the function where it will setup it's reference pointer. Cross-TOC glue in CFM is typically stored out-of-line with the function itself, so there are two branch instructions that you must take to call in CFM, and this glue is probably being flushed from the instruction caches by the function itself. However, the MachO glue is inline with the function, so it is staying cached. So while your right about the repeated call instance, your test naturally favors MachO for it's better cache coherancy WRT function calls. Of course, you are generally unlikely to call through a function pointer in a tight loop like you do anyway.
I've seen this discussed openly by Apple engineers on various mailing lists in MachO vs. CFM discussions. Apple engineers are well aware of this problem, but for whatever reason the directive from the top is to move away from CFM.
Posted by: Trillan on October 22, 2002 4:03 PMJust to clarify the portability of NEXTSTEP: Version 3.3 runs on Intel, m68k, SPARC and PA-RISC hardware (see, for example, http://www.nleymann.de/Nextstep/index.htm). The latter two platforms are, of course, RISC.
Posted by: Toby Thain on October 22, 2002 7:49 PMThe big problem as I see it is the schism between the PEF/CFM and Mach-O development camps.
Apple has been pretty unsuccessful at convincing it's major ISVs (Adobe, Macromedia, etc.) to move to Mach-O. One less obvious side effect of this is that all the plug-in markets associated with those big ISVs are thus also stuck in the PEF/CFM camp. It would also seem that any ISV who needs maximum performance has no choice but to go to the PEF/CFM camp. Add those up, and that's a _big_ chunk of the total Mac ISVs.
Meanwhile, Apple is clearly putting the vast majority of it's efforts and evangelism towards developing for Mach-O using the PB-GCC toolchains. A toolchain which cannot be used for PEF/CFM development. Apple's old "free" CFM toolchain, MPW, is deprecated, and cannot produce Carbon PEF/CFM, IIRC. This means that the only available toolchain for PEF/CFM development is CodeWarrior. While CW is a rather nice platform (and capable of Mach-O output as well), it's expensive. Which has the effect of inhibiting development from small ISVs and shareware/freeware developers, and in turn inhibiting the growth of the rather important "plug-in" markets of those big ISVs' PEF/CFM apps. It also "single sources" the toolchain for most Mac application development on an _outside_ company, which is never a good thing (and makes ISVs twitchy).
So, where does that leave things?
Apple cannot seem to get it's big ISV's to migrate "past" Carbon/PEF/CFM, partly I'm sure on performance grounds. Yet Carbon itself was supposed to be a "transition mechanism", and while Apple's done quite a bit to reduce Carbon apps' "second-class" status in OS X, use of "Cocoa in Carbon" is still a pretty big performance hit.
Carbon/PEF/CFM is likely here to stay. While Apple will undoubtedly continue pushing Mach-O, it seems they also need to resurrect MPW or such so that the Carbon/PEF/CFM toolchain is NOT single-sourced outside Apple. Further, it draws into question the direction Apple should take OS X in the future. Is the answer that Apple needs to refocus on making Carbon the "first-class" OS X API, and start de-emphasizing Cocoa usage?
It certainly would seem very difficult for them to redirect Cocoa towards PEF/CFM. Yet it's equally unlikely that the big ISVs will tolerate their apps being relegated to second-class status when it comes to Apple's tuning of OS X's performance and features. In the short- to mid-term, performance is very much a "critical concern" to Apple marketing, but so is retaining those big ISVs whose apps drive a big percentage of Mac sales.
What to do? Opening PEF and subsidizing adding PEF generation to GCC would certainly be a good start. They also need to figure out how to maximize Carbon/PEF/CFM performance, even if it's at the expense of Cocoa apps (perhaps even moving Cocoa to PEF/CFM, long-term). Will it happen? Probably not. I think there's simply too much Apple exec ego invested in their current direction. And in the end, the users lose.
I just have a hard time seeing how Apple can continue driving OS X tuning and development away from the API and ABI platform used by all it's big ISVs. Someone's got to blink here, and whether Avie and Steve like it or not, it's probably got to be them. In the long-term, they need to realize that MacOS is for the ISVs and users, not the other way around. Currently, they seem a wee bit murky on that particular concept. The big ISVs can leave Mac and survive (and the hit they'd take in doing so is shrinking over time). Maybe it's time Apple set egos aside, and examined whether it can survive the big ISVs leaving -- and act accordingly.
"Portability" is meaningless if they lose their critical ISVs.
Posted by: JFW on October 22, 2002 10:57 PMBy the way, it's also very much worth noting that Metrowerks, the producer of the toolchain used by virtually all of the major Mac ISVs, is now owned by Motorola.
Apple migrating off Motorola, even if it's just to IBM, will impact their relationship with Motorola.
I think Apple's need for a in-house Carbon/CFM/PEF toolchain is going to become a critical issue regardless of how they address any CFM/PEF versus Mach-O issues. And it's probably going to become a critical issue sooner as opposed to later.
Posted by: JFW on October 22, 2002 11:11 PMARGH!!!! First of all, this couldn't be for strictly portable reasons, because the x86 doesn't have PC relative addressing, nor does it have a program accessible PC! (In, x86 though, it's called the IP, or EIP in 32-bit mode)
x86 relies on code being written into explicit locations, if it's PIC code (position independant code) most of it is then done through a faked IP also. (same as PPC) Fact is that if your doing PIC, you need to have IP/PC relative addressing, or at least have a faked register/memory location which holds a base of the program.
The x86 shouldn't need this at all really, when you think about it, because you've all been taught that globals are bad. Well, to tell you the truth, globals _AND_ statics are bad. If you're holding information outside of the function in the heap, then you're doing something wrong. The x86 (and I imagine the PPC, also) is significantly faster using local variables and parameters. (Single %ebp relative addressing mode) rather than globals (absolute addressing in PDC (position dependant) and worse in PIC)
The solution in both? probably to avoid using globals, and just use locals/parameters... especially since the PPC could easily deal with those with a single register indirection.
Just to make it clear, I'm an x86 assembly guru, and no close to zilch about PPC assembly.
Posted by: Daniel Foesch on October 23, 2002 1:20 AMAndrew de los Reyes’ example with the “b r4” PowerPC instruction is wrong. The only branch-to-register instructions available in PowerPC are blr (branch to link register) and bctr (branch to counter register), as well as conditional variants of these. Other possibilities are forbidden because of the complications they would cause to the instruction pipeline.
This issue confirms my impression of the whole OS X effort as a triumph of politics over business sense, or even technological sense. The NeXT folks had been trying unsuccessfully for over 10 years to find a market for their technologies, until they managed to take over the running of things at Apple. Now they’re trying to force the Apple market to adopt their way of doing things, even if it kills them.
Posted by: Lawrence D'Oliveiro on October 23, 2002 2:22 AMJFW says that CodeWarrior is the only tool available for Carbon/PEF/CFM development. This is incorrect. MPW can certainly do it as well.
I agree with JFW‘s other comments: Cocoa will have to be phased out at some point. It’s quite clear from developer adoption rates that Carbon is the only OS-X-native API with a long-term future.
Posted by: Lawrence D’Oliveiro on October 23, 2002 2:30 AMBy the way, I never really liked that “official” ptrglue convention for doing cross-TOC calls on PowerPC. My ComponentGluePPC tool generates the following all-inline sequence instead:
lwz r12, TheProc(RTOC)
stw RTOC, 20(sp)
lwz r0, (r12)
lwz RTOC, 4(r12)
mtctr r0
bctrl
I figure this saves a branch over using ptrglue, so would it use less than the 5 cycles that slava and Feanor estimate?
Posted by: Lawrence D’Oliveiro on October 23, 2002 2:39 AMRe: "Apple's need for a in-house Carbon/CFM/PEF toolchain"
They have one! MPW works perfectly, today, for building Carbon and Classic PEF binaries (and, since Metrowerks' pathetic decision to drop 68K in CW7, it's still good for that platform as a bonus).
Posted by: Toby Thain on October 23, 2002 4:26 AMSo Metrowerks dropped 68k - So what? If your still selling 68k software, then I feel for ya man =).
Additionally, Carbon never was, nor ever will be a transitional technology. Carbon was developed because software vendors demanded a C interface to the GUI, and to tell you the truth, regardless of the advantages of Cocoa, it is good that we have one.
And in the end, that's all Cocoa/Carbon are - GUI builders. Sure they have some additional application support in them both, but for the most part the really interesting stuff in the system is outside of their primary usage range. There is a framework called CarbonCore - a REALLY bad misnomer that I think is only there to make it easy to spot. What CarbonCore really is is a high level interface to the low-level BSD APIs that adds functionality that make all programs better. And on top of it all, CarbonCore is relied on by BOTH Carbon AND Cocoa. So I don't think it's going away anytime soon.
At WWDC Apple stated that currently you can do Cocoa in Carbon (and Carbon in Cocoa). They also stated that in the future they anticipate you being able to mix and match at the widget level. That is, you can take a Cocoa edit text box, and a Carbon popup menu and put them in the same window. Well, the easiest way that I see that being done is if all of the Cocoa widgets are rehosted on top of Carbon widgets (or a 3rd widget library is created that both use). And then Cocoa becomes just like all other application frameworks on MacOS, just implemented in a different language (Obj-C instead of (generally) C++).
Oh, and I wrote an app last night to look at how fast internal/external calls are on CFM vs MachO. Thus far it looks like internal CFM calls are faster than internal MachO calls, however function pointer calls are slower in CFM. Jury's still out on the MachO results however, since some of the data looks a little funny.
Posted by: Rincewind on October 23, 2002 12:41 PMFirst of all, my apologies for stating Carbon/PEF/CFM wasn't possible with MPW, that was (kind of) an error. It's possible to do so, but it requires using tools which Apple has declared deprecated, and end-of-lifed (notably the in-house compilers, MrC and friends). That clearly is a decision which needs to be revisited, particularly if Apple's about to really tick off Motorola (and by extension, Metrowerks). I'm not sure it's the best approach, however, particularly if the OS itself no longer uses MPW (which is the case, IIRC).
Personally, I think there's a lot to be said for Apple just "buying" (permanent source license, whatever) the CW toolchain, akin to what Be did with their CW environment. I doubt anyone would say that the MPW environment (and particularly, the tools) are substantially better than the tools in, say, CW8.
"Buying" CodeWarrior would give Apple a toolset with longer legs, which most of it's developers are using, without leaving Apple at the mercy of Motorola/Metrowerks' future directional changes. It would certainly act to minimize any developer concerns about the single-sourcing of their primary toolchain outside Apple.
Posted by: JFW on October 23, 2002 1:51 PMDoes CarbonCore talk to the BSD layer, or does it talk to the Mach primitives underneath directly? I don't recall seeing much documentation of how it works internally. Also, while CarbonCore may be used by some Cocoa parts, there is still a _substantial_ infrastructure in OS X (including the low-level stuff, such as drivers, etc.) which are living in Obj-C/Mach-O land.
Those are probably the most impacted by inefficiencies in Mach-O, and also (unfortunately) include components of the OS which are among the most performance-critical. The drivers and driver-level APIs are tied up in IOKit (an intrinsically NEXTSTEP/Obj-C/Mach-O construct), The "recommended" user-space client hardware access APIs are also reliant on Obj-C/Mach-O. Even the microkernel itself is designed around and relies upon the Mach-O/dyld environment.
It's very hard to see how Apple could do major changes to those pieces without having _severe_ performance impacts on the system (or incurring massive amounts of work). Having hardware developers rewrite all their drivers, or having Apple to do it, is also huge amount of work. All of this is almost certainly why Apple didn't do it in the first place. Unfortunately, what was convenient before is now (as usual) becoming a problem, functionally and politically.
Posted by: JFW on October 23, 2002 2:44 PMto post a message accuratly explaining the issue:
------
Subject: Re: ABI
From: Dietmar Planitzer
To:
Date: Wed, 23 Oct 2002 19:54:59 +0200
> From: Chris Gehlker
> Date: Tue, 22 Oct 2002 08:52:59 -0700
>
> I'm surprised this hasn't generated a lot of discussion.
> http://www.unsanity.org/archives/000044.php
> Could some subject actually have been talked to death?
Don't think so.
CFM vs. Mach-O the Ultimate Fight
Skimming through the various posts on the site given above, it appears that
a major issue for many people is call overhead of Mach-O vs. CFM.
There are really wild claims flying around, reaching from a 10% up to a 40%
overhead for Mach-O style calls compared to CFM style calls. However,
apparently nobody of those "wild claimists" has obviously done any actual
comparison of both ABIs nor do I have the impression that they have ever
done any profiling on this issue. Otherwise they wouldn't make such
rediculous claims.
So, lets compare both ABIs:
CFM is TOC based. This means that every fragment like a shared library comes
with a table which contains so called transition vectors for both the code
and data symbols it exports to the outside world but also the code and data
symbols it itself imports from other fragments.
This TOC is referenced from one of the CPU registers which is specifically
set aside for this task. In the case of the PPC its GPR2.
A transition vector is a simple data structure which contains both a pointer
to the exported symbol (function or variable) and its associated TOC (so
that the exported function will be able to find its global data).
Its very important to understand that whenever you want to call a function
in another fragment, that you must make sure that GPR2 points to the TOC of
the called fragment. This is so because there is no fixed relationship
between the DATA and TEXT segments. Both can be loaded at any address in
memory by the CFM loader. The only way for a function to find its global
data is to access it through its TOC.
So a cross-fragment or cross-TOC call (i.e. app calls an OS function) must
save the current TOC ('cause we need it again once the func. call returns),
get the transition vector of the callee, establish the callee's TOC and
finally jump to it.
Apple's standard code for this looks like this:
bl moo_glue ;call the cross-fragment glue
lwz R2,R2_save_offs(SP) ;restore the caller πs base pointer
...
moo_glue:
lwz R12,tvect_of_moo(R2) ;get pointer to moo πs transition
;vector
stw R2,R2_save_offs(SP) ;save the caller πs base pointer
lwz R0,0(R12) ;get moo πs entry point
lwz R2,4(R12) ;load moo πs base pointer
mtctr R0 ;move entry point to Count Register
bctr ;and jump to moo
As you can see, a cross-TOC call requires 7 instructions overhead whereby 5
of those instructions access memory. Further, note that we access two
different memory areas (stack + transition vector) which may translate into
two cache misses.
Now, basically the same code sequence is necessary for functions which are
called through a function pointer. In fact a function pointer in the case of
CFM doesn't really point to the function's code, rather it points to the
function's transition vector. Ergo, calling a function through a pointer to
it isn't any faster than calling it directly. Again, the simple reason being
that the function may be called from another fragment, i.e. a callback which
you handed to the OS. Even C++ methods are treaded that way.
Accessing exported variables (global vars) is very easy in a TOC based
system. Just get the necessary transition vector and read the pointer to the
variable out from it. This translates into 2 instructions overhead.
Mach-O on the other hand uses a special trick. It simply dictates that the
relationship between the TEXT and DATA segments must be fixed at all times.
This means that the DATA segment will never move relative to the TEXT
segment which again implies that a function in the TEXT segment can access
its global data via a fixed offset. The offset can be computed by the
compiler and is thereafter never changed.
Thus a call from an application function to the OS function looks like this:
bl moo_glue ;call the cross-module glue
...
moo_glue:
lis r11,0 ;load the high 16bits of the symbol
;ptr address
lwz r12, 0x18(r11) ;store symbol ptr value into CTR...
mtspr ctr, r12 ;...and set r12 to the symbol ptr value
addi r11, r11, 0x18 ; add the low 16bits of the symbol ptr
bctr
(__DATA,__la_symbol_ptr) section
0018 00000000 ; lazy symbol pointer
The above code requires 5 instructions and 1 memory access.
This is 2 instructions and 4 memory accesses LESS then in the CFM case.
Now, in real life we have actually two important kinds of executables:
applications and frameworks (shared libraries). The technically most
important difference between those is that the former is always loaded at a
fixed address, while the other may be loaded at any address. This is where
PIC (position independent code) comes into play. In order to be able to
physically share the code of frameworks, its necessary to ensure that they
can be placed at any virtual address without having to change the physical
code.
The following example should make this clear:
Consider framework A. Dyld loads it into memory at physical address 1000.
However, the framework gets mapped into two different application address
spaces at two different virtual addresses: app A sees it at address 4000
while app B sees it at adress 7000.
Physical code sharing is only going to work for both apps if the framework's
code can stay the same as it was on disk.
One way to achieve this is to use PC relative addressing for both
exported/imported data and code symbols.
This is why a cross-module function call in the case of a framework or
bundle looks like the following:
bl moo_glue ;call the cross-module glue
...
moo_glue:
mfspr r0,lr ;save LR
bcl 20,31,0xC ;get the adress of...
mfspr r11, lr ;...this mfspr instruction into r11
addis r11, r11, 0x0 ; calculate high 16bits of symbol ptr
mtspr lr, r0 ;restore LR
lwz r12, 0x1C(r11) ;move symbol ptr value into r12
mtspr ctr, r12 ;store symbol ptr value into CTR...
bctr ;...branch to CTR
(__DATA,__la_symbol_ptr) section
0028 00000000 ; lazy symbol pointer
In this case we need 8 instructions and 1 memory access.
This is 1 instruction more than the CFM case but still 4 memory accesses
less than in CFM.
None of the above glue code is necessary for functions which are called
through a function pointer no matter if the function call crosses a module
boundary or not. Doing such a call is as easy as executing a simple bl
(branch & link) instruction. Its, contrary to CFM, not necessary to muck
about with a transition vector. This is also true for C++ methods and ObjC
methods.
Accessing exported variables (global vars) is quite expensive in Mach-O. It
requires 7 instructions.
This is 5 instructions more than in the CFM case.
What does this all mean ?
Mach-O has as its most important advantage, compared to CFM, the fact that
calling a function through a pointer is just as cheap as calling a static
function in the same module as the caller is.
This is very important for OO languages like C++ or ObjC. Lets consider
ObjC:
Every ObjC method call requires a jump to the objc_msgSend() function in the
ObjC runtime library. This function basically looks through the receiver's
class method cache and either instantly jumps to the method implementation,
via a simple ba (branch always), if it was found in the cache or loads the
method implementation's address into the cache and jumps then to it, if it
wasn't there.
Because this function is by far called the most often in an ObjC program it
must first and foremost be as fast a possible. Ideally it shouldn't even
touch any memory at all. This, though, is not possible in reality so we
should at least reduce the memory accesses as much as possible.
Now, in the case of Mach-O things are simple. The method cache contains the
implementation addresses of the various methods and we can simply pick the
required address up from there and branch to it.
In the case of CFM, things would be more complicated. It would be necessary
to store the addresses of the various method transition vectors in the
method cache. After we've picked up the required transition vector we would
have to go through the regular CFM glue code. Thus, objc_msgSend() would
start to access more memory, require more instructions for its work and
naturally become slower than it is today.
Remember, method implementations come from different libraries like
Foundation, AppKit or whatever. So we couldn't take any shortcuts in the CFM
case.
In short: objc_msgSend() requires 1 cross-module call in the case of Mach-O;
it would require 2 cross-TOC calls in the case of CFM.
The biggest disadvantage of Mach-O is that it is, compared to CFM, more
expensive to access exported variables. This, however, is only true for
global variables which have been declared 'extern' in the source. Variables
which were declared 'static' can be accesses with two instructions just like
in CFM.
What would it have taken to replace Mach-O with CFM:
First of all it would have been necessary to adapt the whole tool chain to
CFM. This includes tools like gcc, gdb, nm, otool, libtool, as, ld. Then
there are the various development apps like Sampler.app which dig around in
the executable file.
Next is the kernel. It would have been necessary to make the Kernel aware of
resource forks and worse, aware of the Resource Manager file format. Thats
because PEF stores its "header" in the form of a 'cfrg' resource in the
resource fork (the actual code is stored in the data fork).
Don't think that it would be a good idea to make the Kernel dependent on a
particular file system.
Next in line is the dynamic linking server. "Porting" the existing CFM from
MacOS 9 to MacOS X wouldn't have worked, because of the drastic differences
in the lower layers of both OSs (i.e. memory management). So Apple would
have been required to write a new dynamic linking server from scratch.
But wait we're not finished yet. Next in line is the ObjC runtime which also
depends on the executable file format.
Other differences between Mach-O and CFM:
CFM must relocate all TOC entries after the TOC has been read from disk.
This is because the TOC on disk stores its entries relative to the base
adress 0 but the fragment gets loaded at a basically arbitrary address in
memory.
CFM must aggressively resolve all import transition vectors in the TOC at
fragment load time.
Mach-O, because it has no TOC, doesn't have to do these things.
PEF requires in its current incarnation both a data and a resource fork per
executable file. This implies that the dynamic loader has to open and read
from two files (technically forks are just as expensive to maintain as
independent real files). It also ties the executable file format to a
particular FS - may have been cool in the 1960s but not cool in the 21st
century.
Mach-O is strictly a single fork file format. Though, it still manages to be
completely extensible and may hold code for more than one CPU in the same
file.
Some context for comparing CFM to Mach-O:
CFM is deployed on a system where 90% of all APIs are exported by a single
shared library: InterfaceLib.
Take InterfaceLib and QuickTimeLib and you've got 99.9% of all system APIs
covered.
Mach-o is deployed on a system which comes with 125 frameworks and 20 shared
libraries.
The average Cocoa or Carbon application links directly and indirectly
against roughly 30 frameworks.
Regards,
Dietmar Planitzer
_______________________________________________
MacOSX-talk mailing list
MacOSX-talk@omnigroup.com
http://www.omnigroup.com/mailman/listinfo/macosx-talk
--------------------
Above and beyond that, those who have been testing the performance have been cheating!!.
As stated in the message above CFM when loaded (not executed yet) recieves a hit for EVERY FUNCTION CALL. So that CFM can translate that TOC address into it's real address.
Simply that means that you'd have a additional performance hit with every program you load when you load it.
Dietmar Planitzer seems to think that having a kernel that knows about resource forks means tying it to one particular file system.
This is not true. Not counting HFS and HFS Plus, there is already a widely-used filesystem that has integrated support for MacOS resource forks and Finder info, indeed it is fully extensible to cater for any conceivable kind of metadata. Soon it will probably be the most widely-used filesystem in the world. I’m talking about Microsoft NTFS.
The idea that all files should be “flat”—just consist of a data fork, without support for extensible metadata—that is a concept more suited to the 1960s than the 21st century.
Posted by: Lawrence D’Oliveiro on October 24, 2002 6:29 AMActually, outside of the Cocoa Frameworks nothing in the system depends on Cocoa. IOKit at it's lowest level is C and the drivers themselves are C++. I can't give you a direct implementation of CarbonCore (haven't disassembled THAT much of it =p) but I can tell you that for the most part it is implemented on the BSD APIs (after all, anything above them would cause infinate recursion =p). To add behavior parity between Cocoa & Carbon, Cocoa is moving towards using CarbonCore where it makes sense (e.g. in 10.0 -> 10.1 cocoa's document handling moved from file paths to FSRefs & Aliases).
Getting back to the CFM/MachO bit. The performance hit at the startup by CFM having to resolve all of it's function/data exports is small in a program that actually uses those imports a lot. As an example, just imagine traversing a relatively large directory structure. You may call into the library a few hundred thousand times just for that with each function taking it's overhead. Depending on that overhead it may make sense to amortize that up front if you can.
Posted by: rincewind on October 24, 2002 8:12 AMWow, this article sucks. I've noticed lots of articles lately from people trying to be the first to uncover some nasty plot, or flaw, or side effects in OS X. Please do research first. Or just keep your mouth shut when you don't know what you are talking about.
How in the world could the MachO ABI carry over CISC principles from the x86 when the x86 doesn't even support PC-relative addressing? The x86 has PIC overhead as well, as you must execute extra instructions to induce the current PC. One could say that the processors are at fault for poor PIC support, but superscalar and superpipelined processor can't easily maintain PC information for every inflight instruction.
I can't help laughing at your naive insinuations that the ABI ignores RISC principles. You completely ignore the fact that the ABI specifies register usage and other components which are CPU specific. The PowerPC has 32 registers, and MachO allocates 8 of them for parameter passing, one for the stack, some as callee-saved, some as caller-saved, etc.
Before critizing an ABI, first look into the issues. Study the different ABIs and their uses. They always make compromises. The PowerPC has a variety of ABI standards: SVR4, embedded SVR4, AIX, MachO, and whatever OS 9 was using. But making code into a shared library is always difficult and has overhead. Also consider ABIs from other platforms to help put the issues into perspective. They may take different approaches that help clarify the trade-offs and compromises.
But really, just keep quiet. You don't know what you are talking about.
Posted by: wapentake on October 24, 2002 8:16 AM> Cocoa will have to be phased out at some point. It’s quite clear from developer adoption rates that Carbon is the only OS-X-native API with a long-term future.
My friend, that is simply not going to happen. Read what you yourself wrote about the NS people. If they can bulldozer a CISC ABI on Apple with the latter's RISC CPU, do you really think someone else is going to budge them on Cocoa? They're not going to let that happen, Sente is not going to let that happen, Misc and Omni are not going to let that happen, etc etc etc. Besides - toss aside the possible politics here and give the learning curve some time to take effect, and you have the most powerful development environment on the planet.
Carbon is waaaaay too confused with all the quiltwork legacy APIs. There is just no way they can hold all that together. I can see them scratching Carbon, but never Cocoa. It's a mess of a mess whatever way you look at it, but no one is ever going to sacrifice the love child.
R.
Posted by: Rixster on October 26, 2002 8:06 PMAgain, neither API is going away. Carbon is not a quiltwork of legacy APIs. Yes, there is legacy in Carbon, but that part is quite deprecated. Anyone developing a new App based on legacy APIs is setting themselves up for failure. But if Apple hadn't left in that legacy, then MacOS X would have been a HUGE failure.
Mark my words, Carbon in 2 or 3 years will be as different from the MacOS 9 toolbox as the MacOS 9 toolbox is different from the System 1 toolbox.
Posted by: Rincewind on October 27, 2002 5:47 PM"And after all, who cares about a 10% speed loss? You can always get a faster Mac, right?"
That 10% is significant. Ten percent of 1 GHZ is 100 MHZ, which is more bandwidth than computers of only a few years ago had. That 100 MHZ is useful for multitasking. Without it, less gets done.
Posted by: Golem on November 19, 2002 12:40 PMDecisions are not made in a vacuum.
The next-generation Mac OS had been tried several times, and all had failed up to this point. I can completely understand them wanting to get something working out the door now, rather than spend 6 months fixing things that were already working fine.
Especially when you consider at that point, the G4 was all they had. One of the goals of Mac OS X, AFAICT, was to make them less dependent on the PPC. The G4 wasn't the hottest chip on the block any more, and Mac OS X isn't terribly G4-specific (rumor has it there's an internal project called "Marklar" that has an x86 Mac OS X port).
You can mock them for saving a few months early on at the cost of a few months of optimizing later, but if they had needed to wait a few months more to ship anything at all, would there even be a Mac OS X today? Would some companies have simply given up if Apple had said "Yeah, *this* next-generation Mac OS is going to be really great, and we're really going to ship it this time. It's done, but we're reworking the ABI to be completely incompatible with what's running today, so you can't have it for another 6 months -- but trust us"?
For each part of any design, you can find something that's not opmital. If they waited to make everything completely optimal before shipping, it never would have shipped. I, for one, would rather have a good Mac OS X today than a perfect one never.
Posted by: on November 11, 2004 4:02 PMHindsight certainly seems to be 20/20 with OS X on X86 coming out. :)
Posted by: Mikey on July 9, 2005 10:27 AMHindsight certainly seems to be 20/20 with OS X on X86 coming out. :
Yes, and many of these comments look positively stupid considering where apple is with OSX as of 2006. Its amazing the kind of zealotry shown here with people advocating RISC over CISC, Carbon over Cocoa, platform-specific resource forks and PEF over Mach-O just because it was stuff from NeXT and not "Apple".
Just, Amazing.
Posted by: Ken on May 7, 2006 1:13 AMKeep comments on topic. If a comment is unrelated to this post, it may be removed or moderated.
