So far: I. parallel_for II parallel_reduce The math: I. For all i in S, compute f(a[i]) II. R = for all i in S, compute OP f(a[i]) III. x[] -> y[] y[0] = x[0] y[i] = y[i-1] OP x[i] III is the new one: parallel_scan. It looks like it's serial, but it's not because OP is associative. bnoble thinks the book's example of this is wrong, so we'll go through a different one. IV. Quantity *not* known in advance, and *not* recursively divisible in constant time. (e.g. a linked list) This is parallel_while. It's useful for both linear and "exponential" (e.g., trees) structures. First, parallel_scan. Here's the serial version: y[0] = x[0] for(int i = 1; i < n; i++) y[i] = y[i-1] + x[i]; Looks serial, but because + is associative, we can do it in parallel using two passes. The book does a bad job of explaining this, and you might want to check bnoble's slides for pictures here. 1. Partition array +--------------------------------------------------+ | | | | | +--------------------------------------------------+ 2. Compute partial reduction for each subarray. 3. Propagate to the right. For the first partition, propagate the ID. ID R1 R1 R2 R1+R2 R3 R1+R2+R3 R4 +--------------------------------------------------+ | | | | | +--------------------------------------------------+ Need to use auto-partitioner for scheduling, because the number of partitions has to be proportional to the number of *threads*. #3 is a serial step. 4. Compute prefix scans of the partitions in parallel (i.e., the result array). The right hand side you just got is a starter "y[i-1]". 5. Combines RHS -> LHS Total expected speedup is a factor of # processors / 2. Sigh...I need this with silly things like "x" and "y" in it because I don't see the math that corresponds to the original. parallel_scan is a templated function of 3 arguments: Range, Scan Body, Part. Body: 1) Split operator (must respect polarity) 2a) Pre-process (1st pass) functor 2b) Post-process (2nd pass) functor 3) Reverse join (propagate L -> R) (after first pass, serial) 4) Assign (propagate R -> L) (after second pass, parallel) struct pre_scan_tag { static bool is_final_pass() => false }; struct post_scan_tag { static bool is_final_pass() => true }; Scan Body Concept 1) Body::Body(Body & b, split) //B is the new LHS, we are the new RHS 2) void Body::operator()(Range & r, pre_scan_tag) 3) void Body::operator()(Range & r, post_scan_tag) 4) void Body::reverse_join(Body & lhs) 4) void Body::assign(Body & rhs) class SPBody { //sum prefix scan double myReduction; double preReduction; double * resultArr; double * srcArr; public: //arrays don't need sizes b/c range will tell you later SPBody(double src[], double dest[]) : preReduction(0), resultArr(dest), srcArr(src) {} SPBody(SPBody & b, split) : resultArr(b.resultArr), srcArr(b.srcArr), preReduction(0) {} template void operator()(blocked_range & r, Tag t) { double soFar = preReduction; //0 on first pass, resultArr[i-1] on second for(size_t i = r.begin(); i != r.end(); ++i) { soFar += srcArr[i]; if(t::is_final_pass()) { resultArr[i] = soFar; } } //could eliminate soFar, but not a big fan of doing so because //the compiler will get rid of it myReduction = soFar; } //lhs's individual reduction was computed in the first pass, and //then its preReduction has been computed //lhs.preReduction = sum of all blocks to left of LHS //lhs.myReduction = sum of blocks in LHS void reverse_join(SPBody & lhs) { preReduction = lhs.preReduction + lhs.myReduction; //the book goes wrong because it only has one variable for //both preReduction and myReduction } //bnoble would've called it "combine" or "join" void assign(SPBody & rhs) { myReduction = rhs.myReduction; //we can now throw out rhs } }; Example of use: data stream of observations of some particle detector (particles / unit time vs. time) Compute a time series of when the particles were detected (particles detected vs. time) -- this is effectively the integral of the function. This is the first example where a static grainsize is TOTALLY not portable. On to the concept of parallel_while. Here's another serial function. apply(list * l, void(*fn)(item * i)) { while(l) { fn(l->elt); l = l->next; } } This is totally different from what we've seen so far, because the list isn't recursively divisible in constant time. (It's divisible in linear time) There's a small ray of sunshine if fn is *expensive*, because we can fire off executions of fn in parallel with going serially over the list. (This smells like some of the multicore stuff that happened at the end of 583.) The problem is that scale is limited by T(fn(elt)) / (T(advance list) + depth). (Depth is the number of fn invocations we can get to happen in parallel). Enter parallel_while: dispatch until done. templated class w/run method. This abstraction requires 3 things: 1) Type of items in the list 2) The function to process 1 item 3) A way to advance down the list 2 concepts: 1. Body: arg type & function 2. Stream: returns elt if there, else false