Last Time: * Body splitting revisited (applies to scan and reduce) - Takeaways - bodies aren't always split when ranges are split, tend to split during work stealing. A body is guaranteed to see subranges in consecutive order from left to right at most once. * Readers' Digest of OpenMP Today: * NORMA * MapReduce SMP = UMA, worse is NUMA, worse still is NORMA. NORMA = no remote memory access. So far, all of our programming models have assumed SMP: one global shared memory with uniform costs (modulo caching). We've mostly ignored placement thus far. On a NORMA architecture, we have to send messages to get data from other nodes, which is kind of a bummer as compared to just sharing an address space. With UMA, we worred about: * How to parallelize * Load balancing * Reduce dependencies whenever possible. With NORMA, we add: * Data distribution * Result collection * Fault tolerance (partial failure) * Everything from before is a lot harder Specialized vs. commodity Commodity = Best Buy. Solve your problems by buying things off the shelf (Internet, PCs, IDE disks, etc.) Advantages: * Commodity price curve (nobody actually reads machine specs, so we just want to minimize price! Commodity markets are really efficient!) Disadvantage: * Extremely large scale. * Faliure is constant. (some fraction of your nodes will be dead at any given time and some fraction will die during a given computation) Specialized = clever, focus on the interconnect. All kinds of fancy stuff in this space. Almost all of the work here is on increasing bisection bandwidth (bandwidth across an arbitrary division of the machine). The clever specialized world wins when there are lots of cross-node dependencies. MapReduce model: input is a lot of (key, val) pairs. (pretty much any input is like this) input -> MAP -> REDUCE MAP: for each (k,v), produce a *list* of new (k', v') pairs. REDUCE: given one k' and all associated v', produce a (potentially smaller) list of values. To specify a MapReduce computation, you specify: * Input file(s) * A subtype of Mapper (Partitioner) * A subtype of Reducer * output files Global flow * Start with M map tasks and R reduce tasks (M and R can be specified by the user) 1) Input files chunked into M "splits" - each split is one map task 2) Replicate program onto 1 master and W worker nodes. (M >> W) (R > W) 3) Map task reads its tuples and maps them (embarassingly parallel) 4) Results appended to 1 of R "partitions" Obvious partitioner function: Hash(key) % R - has to guarantee each v' for given k' ends up in the same bucket 5) Mapper informs Reducer that it has results ready. 6) Reducers acquire their full set of data from each Mapper. - sort on the key - reduce Meanwhile, the Master watches the whole thing. It knows whether each task is IDLE, RUNNING, or COMPLETE, and if RUNNING or COMPLETE, which worker had it. It also holds and reports the temporary file locations. We have four main challenges here (in a slightly different order from the paper): 1) File/computation locality 2) Grain size 3) Stragglers (a consequence of commodity boxen) 4) Fault tolerance GFS = GoogleFS in this context GFS is based around large append-only files * Large file -> chunks - each chunk is replicated N ways. This replication is strictly for availability. It does cool things like replicating onto different physical racks because racks share power supplies. This replication thing is technically below the filesystem abstraction, but there's a way to access it. It's available because the master wants to assign map tasks to workers that have that task's file chunks. If we can't find a node that has the chunks, we'd like it to be near that file, which depends on network topology. Onto #2: Granularity We want M >> W because: a) |map task| < |GFS block size| b) allows load balancing and scattering during failure We want R > W because (Think I have this wrong, need to check notes) * Runtime O(M+R) * Space O(MR) 3) Stragglers - Workers could be slow. For example, the worker might be doing other computations. The machine might also be starting to fail but not yet failed. - Some last few M,R tasks take a long time because their worker is "slow" - Solution: @end of phase, replicate last few Ms and Rs. we'll deal with what happens if this succeeds during fault tolerance. 4) Fault tolerance - Each worker "heartbeats" to master - running/completed -> new worker (???) Each mapper creates a temp file? The reducer ignores "late map completions". Somehow, first writer wins here. R's output file is written atomically to GFS. If there are two instances of the reducer, the last writer wins. So we have a disparity: with the mapper, first writer wins. with the reducer, last writer wins. If M,R deterministic & functional, then it doesn't matter which order the writes happen because they're the same results. If M,R are nondeterministic, then there is a serial execution that is equal to the result. By the way, if the master fails, they just abort the whole thing and restart. It's hard and expensive to keep 2 masters synchronized.