Green Living & Real Estate Marketing

May 30, 2008

Memory Model

One of the suggestions for a blog entry was the negociated memory model.  This is seasonable, because we’ve but been retooling our overall approach to this confounding topic.  For the most part, I write of product decisions that have already been reached and embarked.  In this note, I’m discussing next directions.  Be disbelieving.

< ?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /> 

Indeed what is a memory model?  It’s the abstraction that gets to the reality of today’s alien hardware comprehendible to software developers.

 

The reality of hardware is that CPUs are renaming registers, doing bad and out-of-order execution, and fastening up the world during retirement.  Memory state is cached at assorted levels in the system (L0 thru L3 on innovative X86 boxes, presumptively with more levels on the way).  Some levels of cache are shared between special CPUs but not others.  For example, L0 is typically per-CPU but a hyper-wound CPU may partake L0 between the coherent CPUs of a single strong-arm CPU.  Or an 8-way box may separated the system into two hemispheres with cache controllers doing an elaborated coherency protocol between these freestanding hemispheres.  If you regard hoarding effects, at some level all MP (multi-processor) computers are NUMA (non-consistent memory access).  But there’s enough magic continuing that yet a Unisys 32-way can mostly be reckoned as UMA by developers.


 

It’s fairish for the CLR to cognize as much as potential about the cache architecture of your hardware so that it can tap any imbalances.  For example, the developers on our performance team have experimented with a scalable rendezvous for phases of the GC.  The idea was that each CPU founds a rendezvous with the CPU that is “nearest” to it in distance in the cache hierarchy, and so one of this pair cascades down up a tree to its nearest neighbor until we make a single root CPU.  At that point, the rendezvous is over.  I cerebrate the jury is however out on this exceptional technique, but they have discovered some other techniques that in truth pay off off on the bigger systems.

 

Of course, it’s absolutely excessive for any negociated developer (or 99.99% of unmanaged developers) to ever occupy themselves with these imbalances.  Rather, software developers desire to process all computers as tantamount.  For pulled off developers, the CLR is the computer and it better act systematically disregarding of the underlying machine.

 

Although brought off developers shouldn’t cognize the difference between a 4-way AMD server and an Intel P4 hyper-meandered twofold proc, they nevertheless involve to look the realities of today’s hardware.  Today, I cerebrate the penalty of a CPU cache miss that locomotes all the way to independent memory is close to 1/10th the penalty of a memory miss that plumps all the way to disk.  And the trend is unmortgaged.

 

If you desired well performance on a practical memory system, you’ve e’er been responsible for for palliating the paging system by leting well page density and locality in your data structures and access patterns.

 

In a like vein, if you desire well performance on today’s hardware, where geting at independent memory is a little disaster, you must load down your data into cache lines and limit indirections.  If you are working up partaken data structures, reckon singling out any data that’s subject to off-key sharing.

 

To some extent, the CLR can facilitate you hither.  On MP machines, we utilise lock-liberal allocators which (statistically) guarantee locality for each thread’s allocations.  Any compaction will (statistically) maintain that locality.  Checking into the very far succeeding – perchance after our sun sets off – you could opine a CLR that can shake up your data structures to accomplish even best performance.

 

This thinks that if you are saving single-meandered negociated code to process a server request, and if you can head off writing to any partaken state, you are in all likelihood plumping to be pretty scalable without yet examining.

 

Having rearward to memory models, what is the abstraction that will progress to sense of current hardware?  It’s a simplifying model where all the cache levels vanish.  We guess that all the CPUs are inhered in a single partaken in memory.  Today we merely involve to cognise whether all the CPUs realize the same state in that memory, or if it’s potential for some of them to realize reordering in the loads and stores that occur on other CPUs.

 

At one utmost, we have a world where all the CPUs understand a single coherent memory.  All the loads and stores expressed in programs are performed in a serialized manner and nobody comprehends a special thread’s loads or stores being reordered.  That’s a wonderfully reasonable model which is well-to-do for software developers to cover and program to.  Alas, it is far too ho-hum and non-scalable.  Nobody works up this.

 

At the other utmost, we have a world where CPUs run almost all out of individual cache.  If another CPU e’er realizes anything my CPU is managing, it’s a full accident of timing.  Because loads and stores can propagate to other CPUs in any random order, performance and surmounting are heavy.  But it is unacceptable for humans to program to this model.

 

In between those extremes are a lot of dissimilar possibilities.  Those possibilities are explained in terms of develop and release semantics:

 

  • A normal load or store can be freely reordered with respect to other normal load or store operations.
  • A load with get semantics makes a downwardly fence.  This intends that normal loads and stores can be displaced down past the load.grow, but nothing can be moved to above the load.produce.
  • A store with release semantics makes an up fence.  This thinks that normal loads and stores can be moved above the store.release, but nothing can be moved to below the store.release.
  • A total fence is efficaciously an upward and down fence.  Nothing can pull in either direction across a total fence.

 

A super-firm utmost model assigns a total fence after every load or store.  A super-unaccented utmost model applies normal loads and stores all over, with no fencing.

 

The most conversant model is X86.  It’s a relatively firm model.  Stores are ne’er reordered with respect to other stores.  But, in the absence of data dependence, loads can be reordered with respect to other loads and stores.  Many X86 developers don’t understand that this reordering is potential, though it can lead to some awful failures under stress on magnanimous MP machines.

 

In terms of the above, the memory model for X86 can be drawn as:

 

  1. All stores are in reality store.release.
  2. All loads are normal loads.
  3. Any use of the LOCK prefix (e.g. ‘LOCK CMPXCHG’ or ‘LOCK INC’) makes a total fence.

 

Historically, Windows NT has kept going Alpha and MIPS computers.

 

Depending frontward, Microsoft has denoted that Windows will patronize Intel’s IA64 and AMD’s AMD64 processors.  Finally, we take to port the CLR to wherever Windows runs.  You can make an obvious conclusion from these facts.

 

AMD64 has the same memory model as X86.

 

IA64 stipulates a light memory model than X86.  Specifically, all loads and stores are normal loads and stores.  The application must utilise particular ld.acq and st.rel instructions to attain get and release semantics.  There’s besides a total fence instruction, though I can’t recollect the opcode (mf?).

 

Be especially disbelieving when you take the next paragraph:

 

There’s some reason to trust that current IA64 hardware in reality implements a firm model than is stipulated.  Based on informed hearsay and lots of data-based evidence, it calculates like normal store instructions on current IA64 hardware are retired in order with release semantics.

 

If this is so the case, why would Intel qualify something light than what they have built up?  Presumptively they would do this to provide the door undetermined for a unaccented (i.e. faster and more scalable) implementation in the future.

 

In fact, the CLR has managed precisely the same thing.  Section 12.6 of Partition I of the ECMA CLI specification explicates our memory model.  This explicates the alignment rules, byte ordering, the atomicity of loads and stores, explosive semantics, shuting away behavior, etc.  According to that specification, an application must apply explosive loads and explosive stores to accomplish grow and release semantics.  Normal loads and stores can be freely reordered, as seen by other CPUs.

 

What is the hard-nosed implication of this?  See the received double-locking away protocol:

 

if (a == null)

{

  lock(obj)

  {

    if (a == null) a = new A();

  }

}

 

This is a mutual technique for deflecting a lock on the read of ‘a’ in the distinctive case.  It acts just all right on X86.  But it would be broken by a sound but light implementation of the ECMA CLI spec.  It’s reliable that, according to the ECMA spec, developing a lock has develop semantics and leting go of a lock has release semantics.

 

Even so, we have to presume that a series of stores have involved place during construction of ‘a’.  Those stores can be haphazardly reordered, including the possibility of holding up them until after the writing store which designates the novel object to ‘a’.  At that point, there is a little window before the store.release implied by providing the lock.  Inside that window, other CPUs can navigate through the reference ‘a’ and realise a partially constructed instance.

 

We could set this code in assorted ways.  For example, we could slip in a memory barrier of some sort after construction and before assignment to ‘a’.  Or – if construction of ‘a’ has no side effects – we could displace the assignment outside the lock, and apply an Interlocked.CompareExchange to see to it that assignment just befalls one time.  The GC would compile any supererogatory ‘A’ instances created by this race.

 

I trust that this example has converted you that you get into’t desire to prove saving honest code against the documented CLI model.

 

I saved a middling amount of “cagey” lock-liberal thread-dependable code in version 1 of the CLR.  This let in techniques like lock-liberal synchronization between the class loader, the prestub (which trammels foremost turns methods so it can bring forth code for them), and AppDomain offloading indeed that I could rearward-patch MethodTable slots expeditiously.  But I have no desire to save any kind of code on a system that’s as unaccented as the ECMA CLI spec.

 

Still if I essayed to save code that is robust under that memory model, I have no hardware that I could try it on.  X86, AMD64 and (presumptively) IA64 are firm than what we stipulated.

 

In my opinion, we drove in up when we conditioned the ECMA memory model.  That model is undue because:

 

  • All stores to partaken in memory in truth ask a explosive prefix.
  • This is not a generative way to code.
  • Developers will much reach mistakes as they trace this burdensome discipline.
  • These mistakes cannot be discovered through testing, because the hardware is too firm.

 

Indeed what would get to a sensitive memory model for the CLR?

 

Well, first of all we would desire to have a ordered model across all CLI implementations.  This would admit the CLR, Rotor, the Summary Frameworks, SPOT, and – ideally – non-Microsoft implementations like Mono.  Indeed assigning a mutual memory model into an ECMA spec was unquestionably a good idea.

 

It goes without supposing that this model should be ordered across all potential CPUs.  We’re in magnanimous trouble if everyone is testing on X86 but and then deploying on Alpha (which had a notoriously unaccented model).

 

We would as well desire to have a coherent model between the aboriginal code generator (JIT or NGEN) and the CPU.  It doesn’t reach sense to stiffen the JIT or NGEN to order stores, but and so leave the CPU to reorder those stores.  Or vice versa.

 

Ideally, the IL generator would likewise trace the same model.  In other words, your C# compiler should be provided to reorder whatever the aboriginal code generator and CPU are allowed for to reorder.  There’s some debate whether the converse is reliable.  Arguably, it is o.k. for an IL generator to utilize more belligerent optimizations than the aboriginal code generator and CPU are allowed, because IL generation occurs on the developer’s box and is subject to testing.

 

In the end, that last point is a language decision kind of than a CLR decision.  Some IL generators, like ILASM, will strictly give out IL in the sequence specified by the source code.  Other IL generators, like Negociated C++, might quest after belligerent reordering based on their ain language rules and compiler optimization switches.  If I had to pretend, IL generators like the Microsoft compilers for C# and VB.NET would make up one’s mind to esteem the CLR’s memory model.

 

We’ve passed a lot of time entertaining what the right memory model for the CLR should be.  If I had to pretend, we’re plumping to switch from the ECMA model to the tracing model.  I cerebrate that we will examine to sway other CLI implementations to borrow this same model, and that we will try out to interchange the ECMA specification to ponder this.

 

  1. Memory ordering merely applies to locations which can be globally seeable or locations that are ticked explosive.  Any locals that are not direct revealed can be optimized without applying memory ordering as a constraint since these locations cannot be touched by multiple threads in parallel.
  2. Non-explosive loads can be reordered freely.
  3. Every store (irrespective of explosive marking) is regarded a release.
  4. Explosive loads are seen produce.
  5. Device orientated software may demand particular programmer care.  Volatile stores are nevertheless required for any access of device memory.  This is typically not a concern for the carryed off developer.

 

If you’re cogitating this depends an nasty lot like X86, AMD64 and (presumptively) IA64, you are proper.  We besides cerebrate it strikes the cherubic spots for compilers.  Reordering loads is much more crucial for enabling optimizations than reordering stores.

 

Indeed what happens in 10 years when these architectures are plumped and we’re all employing futurist Starbucks computers with an extremist-light model?  Well, hopefully I’ll be holding out the well life in retirement on < ?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" />Maui.  But the CLR’s aboriginal code generators will bring forth whatever instructions are necessary to hold stores told when runing your bing programs.  Evidently this will give some performance.

 

The trade-off between developer productivity and computer performance is truly an economical one.  If there’s sufficient incentive to save code to a unaccented memory model so it can run expeditiously on next computers, so developers will do indeed.  At that point, we will leave them to mark off their assemblies (or item-by-item methods) to argue that they are “unaccented model unobjectionable”.  This will let the aboriginal code generator to give off normal stores sort of than store.release instructions.  You’ll be capable to reach mellow performance on unaccented machines, but this will e’er be “opt in”.  And we advanced’t work up this capability until there’s a veridical demand for it.

 

I in person trust that for mainstream figuring, light memory models will ne’er get on with human developers.  Human productivity and software reliability are more crucial than the increment of performance and surmounting these models leave.

 

At last, I cerebrate the person asking about memory models was in truth interested in where he should utilise explosive and fences in his code.  Here’s my advice:

 

  • Utilise negociated locks like Monitor.Enter (C# lock / VB.NET synclock) for synchronization, except where performance in truth expects you to be “cagy”.
  • When you’re being “cagy”, take for granted the relatively firm model I delineated in a higher place.  Only loads are open to re-ordering.
  • If you have more than a few places that you are applying explosive, you’re belike being too cagey.  Regard seconding off and utilising pulled off locks rather.
  • Realise that synchronization is expensive.  The total fence implied by Interlocked.Increment can be many 100’s of cycles on forward-looking hardware.  That penalty may carry on to produce, in proportional terms.
  • View locality and hoarding effects like red-hot spots due to sour sharing.
  • Stress test for days with the magnanimousest MP box you can have your hands on.
  • Call for everything I said with a grain of salt.

Relating Posts:
DevWeek 2008 Cross Platform Silverlight Demos
Presenting Microsoft Tagspace
LoadFile vs. LoadFrom
A little favorable competition…
LoadFrom’s Second Bind
Is the Java SE/Java EE difference stock-still relevant?

Comments

The URI to TrackBack this entry is: http://statuska.blogsome.com/2008/05/30/memory-model/trackback/

No comments yet.

RSS feed for comments on this post.

Leave a comment

Sorry, the comment form is closed at this time.






















Get free blog up and running in minutes with Blogsome
Theme designed by Hadley Wickham