Last time: * Think Parallel * Disclaimer * Overview Today: * HW basics (hardware, not homework) * Reminder on single-processor machines * Symmetric multiprocessing (SMP, UMA) - Multiprocessing - Multithreading * Cache consistency * Asymmetric multiprocessors - Shared (NUMA, non-uniform memory access) - Discrete (NORMA, no remove memory access) Most of our semester will be spent with other things than discrete memory multiprocessors. NORMA is Google space, however. Re: question about readings, the first chapter or two of TBB, and we're not quite ready for the beginning chapter of the patterns book (I thought it was fine). The 470 text does the best job of talking about this. There will be some other readings posted somewhere (CTools) as well. This is going to be a CS version of the hardware coverage, not the EE version. Single processor has registers and cache, bus connects it to I/O devices and physical memory. I/O devices are typically active -- they can act on their own (see: DMA transfers). The bus manages access to memory; it's the point of arbitration between everything else and memory. In some sense, the CPU is the master of the bus (insert song here) and the I/O devices are the slaves. That is, the CPU gets to tell the disk what piece of physical memory to dump into for its DMA transfers. Cache design: The cache has some number of blocks, and each block has "words" of data and metadata describing the cache block (the tag, dirty bit, and maybe some more stuff we're not going to worry about). The 3 Cs of cache misses: * Compulsory misses - first time you use data, it HAS to be a miss * Capacity misses - the cache is smaller than physical memory, if our working set is too big we have to kick something. * Conflict misses - if cache not fully associative, blocks kick each other out of just their set. Multiprogramming: sharing one CPU across many processes. (KIND OF IMPORTANT, LOLZ) 1) One process is "active", or running on the CPU. Address space: ========== | stack | +--------+ | } | | | | +--------+ | heap | +--------+ | globals| +--------+ | text | ========== Stack + registers describes the "point in time" of a thread, the entire address space is the process state. 2) There are some processes ready to run. To switch from running to ready, we pack up the state of the running process, load the state of the ready process, and kick it to make it go. By the way, the cache is totally hosed after a context switch because different processes share nothing. So, the simplest picture of a multiprocessor just adds more CPUs onto the bus. (see bnoble scribbles for more pictures; ASCII art is so limiting) This picture is a symmetric multiprocessor, because each CPU has equivalent access to each item on the bus. In particular, they have equivalent access to memory -- cache miss penalty for each CPU is the same. Both CPUs are active; the CS word is that they are "peers". SMPs are also sometimes called UMA machines. SMPs are kinda awesome for multiprogramming because they have more cache - one cache per CPU. There are clever OS tricks for dealing with this: gang/affinity scheduling (try to keep scheduling a process on the same CPU repeatedly just in case something gets left in the cache). This is a win because there's effectively no sharing. Multithreading is more "interesting" than multiprogramming. If you have a compute-bound process that happens to be amenable to parallelization, then we can apply multithreading. ========== | stack1 | +--------+ | stack2 | +--------+ | | +--------+ | heap | +--------+ | globals| +--------+ | text | ========== Need some padding between the stacks, whatever. First real serious problem: what happens if we have a global variable x that is cached by both processors? (x is *replicated* from the memory to each processor) If CPU1 changes x and CPU2 later reads x, it would be nice if CPU2 got the most recent value of x. Some machines don't guarantee this, but we'll talk about machines that do. (This is called sequential consistency.) Intuition behind solving the problem: arbitrate "ownership". At most one processor owns a value at once. Must own to *write*, not to read. To implement this, each cache block is going to get some extra bits that describe a state machine. Two halves: changes in state in response to local actions, and changes in state in response to remote actions. On an UMA machine, serialization still happens on the bus. These actions will be atomic. Start state: INVALID. When INVALID: On local read --> bring this line in, and transition to SHARED. On local write --> bring this line in, and record line as MODIFIED. MODIFIED means 2 things: I own it, and I have most recent copy. When SHARED: On local write ("upgrade" from shared to exclusive, uses the bus) --> MODIFIED. This is a fourth C for cache misses: contention or consistency miss. (It's a miss because you have to wait for the bus.) This is conservative consistency control -- only one processor can update the line at a time. It's also called a "read for write". When MODIFIED: On local writeback (this *could* happen without evicting the line from the cache) --> OWNED. When OWNED: On local write --> MODIFIED. No bus traffic. Sometimes, MODIFIED and OWNED are combined into EXCLUSIVE because an external observer can't tell the difference. State machine description for remote actions: (same states) When INVALID: When SHARED: On remote write, evict and --> INVALID. When MODIFIED: On remote write, we transition to --> INVALID. ***and spit the value out on the bus because of false sharing possibility*** On remote read, fill in value and --> MODIFIED. When OWNED: On remote write, evict and --> INVALID. On remote write, we downgrade to SHARED. Write sharing *requires* bus traffic. Consider the following: int c1 = 0; //Thread 1 global int c2 = 0; //Thread 2 global These aren't shared, but they're contiguous in memory, and so the processors might be writing to different parts of the same cache line. (This is called "false sharing".) This can result in a lot of bus traffic -- each write to this cache line generates traffic! The end result is that this cache consistency protocol puts additional pressure on the bus over the memory pressure. Worse, memory pressure is increasing faster than bus capacity, so UMA is only going to take us so far. Eventually, we end up with NUMA: In addition to CPU, I/O, & memory on the bus, we also throw a "director" in there. All "collections" are connected to a really big switch (in the networking sense). The "CPU" in a collection might itself be parallel. If there's shared memory, we do ccNUMA (cache-coherent NUMA). It's still consistent, but latency depends how "close" processors are on the bus. We've got to worry about data placement. There's also NORMA, which amounts to giving up on this consistency crap. Programmer must explicitly get data; this is aka message passing. Expectation is that we'll focus on shared memory. Notion of speedup: If you have a program that's 80/20 parallelizable (80% can be parallelized), how much faster do things get when you add processors. 1->2 CPUS: .2 + .8/2 = .6 2->4: .2 + .8/4 = .4 4->8: .2 + .8/8 = .3 Given a fixed problem size, we get diminishing returns on additional processors, even if the parallelizable part is embarassingly parallel. What we worry about in terms of making things better: 1) Raising the "parallel ratio" 2) Reduce the grain size. Grain size: smallest unit of work we can get out of the problem. 3) Reduce communication between parallel parts. Sadly, this is "better, faster, cheaper: pick 2". That is, these ways to improve things are in conflict with each other. "Package plan analogy" can go to a restaurant and pay for things a la carte or just give a pile of money up front and do all-inclusive.