Last Time * Safe vs. Scalable vs. Simplicity * Techniques Coming up: (This is a good 4-6 lectures) * Other concurrency systems - Transactional memory (hasn't been built -- software is borked, hardware needs vendors) - Automatic mutual exclusion - OpenMP * Tools for correctness - Raur X - MUVI Problems with locks: - Not composable - Coasts vs. fine problem TBB (OpenMP) - Express data parallelism for independent activities - Limited in scope; sometimes we don't have independence! * locks * serialize * etc. Both of these approaches are examples of "conservative consistency control". In the past we saw an idiom for caching an atomic value and redoing the operation only if someone interfered. atomic foo; do { T _foo = foo; T t = do_something(foo); }while(foo.compare_and_swap(t,_foo) != _foo); This is called optimistic concurrency control. Compare the two: Conservative ============ * High overhead in low contention situations Optimistic ========== * High overhead in high contention situations The observation is that if you've written a parallel program that has high contention, you're an idiot, because it's not going to parallelize effectively. Happily, transactional memory is optimistic. Transactions come from the database folks, and classically have four "ACID" properties: Atomic: either a whole transaction is completed or none of its parts are. Consistent: nebulous, almost always application-defined. Isolation: no other transaction can see partial updates. (feels a bit like atomicity, is different) Durability: we don't care because we're talking about in-memory computation. Thus, in the context of transactional memory, all we care about is atomicity and isolation. Here's what it looks like: begin_transaction; { do stuff; read some words; write some others; do more stuff; write some more; } end_transaction; 1. start - every subsequent store must save memory value before transaction AND the last written value. (*logging*) - (recording read and write set) 2. at end either commit (save only new values) or abort (save old values) Decision: If the all the intersections of (union of my read set and write set) with the write sets of other in-flight transactions are empty, I can commit. Otherwise, we roll back. This is called the *conflict check*. Optimizations: * We can perform the conflict check on every access if it's particularly cheap and abort the transaction early. Composition is a subtlety: what happens if we have nested transactions? * Closed nesting: on inner commit, "flatten" into outer transaction. There are two main advantages of the transactional memory model. * Appealing programming model; composition is easy and there are no lock sets. * "fine-grained" locking "for free". Implementation questions 1) Update in place? - Can either stash the old value of a location and update in place OR just stash the new value somewhere bnoble is pretty convinced there's no reason not to update in place 2) Software or hardware? grain size? If you're implementing in software, you only have the hardware virtual memory mechanisms to rely on. Recall how this works: the hardware keeps the virtual page table and has protection bits. Whenever you start a transaction, you flip the bits to "protected". That way, you'll incur page faults on accesses, and you can have the STM implementation handle the faults. What's the granularity of access? If we fault on every word, every access incurs a page fault, which is just AWFUL. We just can't do that, so it must be per-page. On reads, we add the read page to the read set and set it to write-protected. On writes, we add the page to the write set and set it to unprotected. The problem with page granularity is the false sharing. We can get finer granularity with a hardware implementation, but that's tricky, because hardware is finite. The old/new value set can be very large; can't dedicate hardware so we have to spill onto the bus. In addition, the read/write *sets* are also large. (at this point, we're talking specifically about the Wisconsin implementation, which bnoble *thinks* is the winner) In Wisconsin, we're spilling the old values into a virtual memory log. Originally, the read/write sets were tracked exactly and spilled if they got too big. The Wisconsin folks decided that this was wrong and decided to use Bloom filters. A Bloom filter is like a hash table: n-bit structure represents an "abstract" of the set. Rule: if A is in S, then Bit(A) = 1. Converse is NOT necessarily true. Num Bits in N < total cache lines in the set (cache lines are hte elements in STM) We have to use a hash function to map from inputs to bits; can't just mod by N because strided accesses will be BAD BAD BAD. So, what's the grain size for these sets? Our goal is to make conflict detection almost free; if we make N the size of a register, then we get a total win because we only need two extra registers per core. Effectively, with Bloom filters, we have N locks randomly assigned to cache lines. Because programs tend to have local accesses, the accessed lines will get mapped to different bits. So far, the simulations have borne out the theory that fast checking is better than precise checking. Another characteristic: long transactions are "easy", but short transactions are fast. The real problem with this is that external items aren't undoable. For example, printing to the screen and writing to the network would be bad. However, Speculator (by Prof. Flinn and possibly Prof. Chen too, I can't recall) might be applicable here. If any transaction updates OS shared state, it's a lose. For example, updating the page table base register is a huge conflict.