Last time: * Generic Programming via templates -> complex dependencies * Parallel For Concepts: Body - Copy constructor - Destructor - void Body::operator()(Range & r) const Btw, 35%-40% of the class has taken 381 -- last lecture was repetitive for that portion. Range concept: * Copy ctor, dtor * bool empty() * bool is_divisible() * Range::Range(Range & r, split) Applying the splitting constructor to a range modifies the range to be its "left half" and creates a new range that's the right half. parallel_for(Range &,Body &) parallel_for recursively divides Ranges in what is probably the obvious manner (split until CPUs are saturated). I recommend groveling through the relevant chapter of the TBB book if you didn't show up for this, as I can't be arsed to draw a pretty tree for you. Example time! void apply(double a[], int size, double (*fn)(double)) { for(int ii = 0; ii < size; ii++) a[i] = fn(a[i]); } Let's make this parallel_for-able. 1. Define a Range. 2. Define a Body. 3. Rewrite apply to use parallel_for. Yay. Instead of using blocked_range, we'll write a simple Range today. class MyRange { public: int start; int limit; //half-open interval, so limit >= start MyRange(int s, int l) : start(s), limit(l) {} //default copy ctor and dtor will be fine. bool empty() const { return start == limit; } // bool is_divisible() const // { return limit - start > 1; } // we can write is_divisible to do something clever like restrict // ranges to bigger than 1000 elements bool is_divisible() const { return limit - start > 1000; } MyRange(MyRange & other, tbb::split) //REQ: other.is_divisible() : start(other.start + (other.limit - other.start) / 2), limit(other.limit) { other.limit = start; } //bnoble impl: // int m = (other.start + other.limit) / 2 // start = m; // limit = other.limit; // other.limit = m; //NOTE: my midpoint formulation is the necessary one if start and //limit are pointers. }; This wasn't implemented inline in class, dunno why. class MyBody { double * a; //non-owning pointer to array double (*fn)(double); public: MyBody(double * arr, double (*func)(double)) : a(arr), fn(func) {} //default copy ctor and dtor are OK void operator()(const MyRange & range) const { for(int ii = range.start; ii < range.limit; ++ii) { a[ii] = fn(a[ii]); } } }; //NOTE: if you're feeling the "Range ought to hold iterators, especially if they're Random Access Iterators" design jumping out at you, tbb::blocked_range totally can handle that. void apply(double a[], int size, double(*fn)(double)) { parallel_for(MyRange(0,size), MyBody(a,fn)); } Pseudocode parallel_for implementation: parallel_for { while(!done) { split off chunks to maintain N live chunks of work (constructing ranges) assign chunks to tasks wait for a task to complete } } The "obvious" question: why does the range have to be bigger than 1000 elements to split? Grain size: minimum chunk of work (sequential). How do you pick a grain size in practice? TBB provides a class that's a "linear range": blocked_range. ctor: blocked_range(start,limit,grain). start and limit must be Values. Value concept: copy ctor/dtor bool operator<(V i, V j); size_t operator-(V i, Vj); //# elts in [i,j). Value operator+(V i, size_t k); //kth elt after i btw blocked_range takes a grain size. What grainsze do we pick? Bad news: there's never just one number. - Too large -- not enough parallelism. Cores don't do work. - Too small -- maybe too much parallelism? Swamped by overhead. Depends on # CPUs: more --> smaller grain size. Also depends on # of cycles Body::operator() takes. faster --> larger grain size. Good news: usually, there's a large grain size "sweet spot". Better news: template parallel_for(Range & r, Body & b, Partitioner & p) // not sure if the last arg is a reference The partitioner does "*should* you split the range". bnoble is pretty sure we'll never implement a partitioner. The library provides two partitioners: simple (if you can divide, you should) and auto (divide only "for enough" to keep busy). If you read the book from front to back, the auto partitioner talks about tasks and work stealing and trying to balance load, and it doesn't make sense until chapter 9 or 10.