Christoph Koppe, 05/14/96,
Large caches used in scalable shared memory architectures can avoid the high memory access time only, if data is referenced within the address scope of the cache. Consequently locality is the key issue to multiprocessor performance. One goal of software developement is a high degree of reference locality from the system up to the application level. Even if application designers develope code with a high reference locality, the impact of caches is reduced, when scheduling policies ignore locality information. Disregarding locality the scheduler will initiate switches into uncached process contexts. The consequences are cache and TLB misses for the processor of interest and cache line invalidations in caches of other processors.
NUMA architectures like KSR or Convex SPP already provide locality information gathered by special processor monitors or by the cache coherence harware. Coming processor generations - e.g. HPPA 8000 or SGI R10000 - include a monitoring unit. A processor monitor can count events like read/write cache misses and processor stall cycles due to load and store operations. Locality information about each process/thread - like the cache miss rate, processor stall time and the processor of last execution - is used to calculate a priority value.
The parallelism expressed using "UNIX-like" heavy-weight processes and shared memory segments is coarse-grained and too inefficient for general purpose parallel programming, because all operations on processes like creation, deleting and context switch invoke complex kernel activities as well as costs associated with cache and TLB misses due to address space changes. Contemporary operating systems (like SUN's Solaris or MACH) offer middle-weight kernel-level threads decoupling address space and execution entities. Multiple kernel threads mapped to multiple processors can speed up a parallel application. But kernel threads just offer a middle-grained programming model, because thread management implies expensive protected system calls. The potential benefit of using locality information increases with the frequency of scheduling decisions. Consequently the benefit of using locality information in the kernel will be limited by the low frequency of scheduling decisions
By moving thread management and synchronization into user-level the cost of thread management operations can be drastically reduced to one order of magnitude more than a procedure call. Some advantages of user-level threads are: