Last time: * Redid tree version of parallel_while * Pipeline (SW: really? I don't recall) This time: * Re: proposals, Difference between thread-safe and scalable * Revisit tree traversal Projects due last day of class: Tuesday, April 15 There will be a final! We'll have quick demos of projects scheduled during finals week. Anyway, the difference between thread-safe and scalable: Consider random() / seed(val v) 1) When you call seed, it sets some static or global state. 2) Subsequent calls to random manipulate the static/global state. There are 2 solutions to this from the perspective of correctness: 1) Wrap all calls with a lock this is thread safe, but not "scalable" 2) Scalable safety Two approaches: * Decouple entirely - srand48(long int seed, struct drand48_data *buf); //create a "rand state" in "buf" each thread gets its own drand48_data - Memory allocation uses a single pool of free space Clearly not re-entrant. You can wrap it with locks so it's safe but not scalable, but there's another problem: if you're allocating things that are smaller than cache blocks, you can easily create false sharing. TBB provides a scalable allocator with per-thread pools. It *may* still have false sharing. example: false sharing could matter in pipelining. TBB also provides a cache-aligned allocator, which guarantees that any two things you've allocated will never experience false sharing. The downside is that it has larger memory pressure. This is accomplished by making the minimum allocation N cache lines, where N is a small integer. In the book, the conventional wisdom is to start with the scalable allocator and see if switching to the cache-aligned allocator speeds things up. A few words about TBB containers: One thing interesting about TBB is the tradeoffs they made. There's a question of interface purity vs. safety and scalability. For example, in linked structures it's very painful to have accessors to the first or last element. For example, TBB's queues just don't have front() and back(). The iterators are both painful and not thread safe. "If you sneeze sideways, these iterators are garbage, so they didn't even bother making them fast!" Regarding the queue: pop() in the STL always returns. pop() in TBB blocks if empty (like SafeQ). size() in the queue is now signed -- size of -n means that there are n waiters. pop_if_present() is a non-blocking pop. concurrent_queue may be bounded or unbounded when unbounded: push always succeeds (but memory pressure if push threads are faster than pop threads) concurrent_vector: an element's location in memory WILL NOT change. (not guaranteed in std::vector) The TBB version is an unrolled linked list -- links of blocks. The disadvantage is that serial operations are slower. concurrent_hash_map: Writing a safe hash map is actually really tricky. Typical use is to write: T & v = m.find(k); //... v = foo(/*...*/); In this example, making the hash map a monitor isn't even correct! You could lock the table at the beginning of the example and unlock at the end, but that's really bad for scalability. Instead, the concurrent_hash_map provides smart references. It has two types called an accessor and a const_accessor. (this is grossly oversimplified; read the chapter for actual usage) concurrent_hash_map::accessor v = m.find(k); //obtains a write lock on the ELEMENT m[k] In effect, we've got per-element locking. const_accessor gives a reader lock, whereas accessor gives a writer lock. A problem: struct tree { elt * elp; tree * left; tree * right; }; void apply(tree * t, elt(*fn)(elt)) { //REQ: t != 0 t->elp = fn(t->elp); if(t->left) apply(t->left, fn); if(t->right) apply(t->right, fn); } Recall we did this with parallel_while, which is mostly useful for things that aren't recursively divisible in constant time. Let's think about doing it in constant time. Reminder: Range was a passive entity that described the limits of work. Body was an active container that knew 1) how to apply work to any subset of the container 2) did all the work. Well, we can apply fn() during Range::split -- actually doing work! If we want the Range to be able to tell when to stop splitting, the tree has to carry its size. struct tree { elt e; size_t cnt; //# nodes in this tree, incl. this tree *l, *r; }; class TRange { public: tree *t; size_t lim; elt (*fn)(elt); TRange(tree * t, size_t grain, elt(*fp)(elt)) :t(t), lim(grain), fn(fp) {} bool is_divisible() { return t->cnt > lim; } TRange(Trange & rhs, split) { t->e = fn(t->e); rhs.t = t->r; rhs.lim = lim; rhs.fn = fn; t = t->l; } bool is_empty() { return !t; } }; class TBody { public: void operator()(TRange & r) const { apply(r.t, r.fn); //serial apply from before } }; void parApply(tree * t, elt (*fp)(elt)) { TRange r(t,1,fp); TBody b; parallel_for(r, b, auto_partitioner()); }