Last time: * Hardware basics - UMA/SMP - NUMA, ccNUMA - Both of the above provide a "consistent" memory image across processors * Sequential consistency: all processors agree on write order * NORMA - no single global address space; this is message passing On Friday: * Talked about kinds of projects shooting for in the course - 2 or 3 self-contained exploratory projects, while in parallel we decide on longer term projects in the course. - No group formation until the third week - Short projects: 1 with threads, 1 with TBB - Long projects: missed first point, second point is shared memory parallelism Today: * Intro threads Recall picture from last time: moving from one thread to multiple threads (CPU/register/stack). Parallelizing matrix multiplication (A * B = C w/A,B matrices) a_ij is the dot product of the ith row of the first matrix and the jth column of the second. Program under consideration (paradigm similar to MapReduce from Google): Read A,B from disk. multiply(A,B,C) Write C to disk. Serial impl of multiply: for i from 0 to n for j from 0 to n C[i,j] = vmult(A,B,i,j) Add extra args to make it just do a chunk: for i from L to R for j from U to D C[i,j] = vMult(A,B,i,j) multSub(Up, Lo, Left, Right, A, B, C) We parallelize the computation by taking advantage of there being no sharing between each element -- the obvious way. Each thread calculates a chunk of the necessary vector products. HOMEWORK FOR ME: compare this with the 477 carving I have in mind in Haskell or TBB. thread_create(func, arg) is sort of like a parallel call to func(arg). typedef void *(*body_t)(void *); So we have to get multSub to look like a body_t: struct MultSubArgs { int upper, lower, right, left; Matrix A,B; }; void * mSub(void * vp) //REQUIRES: vp is of type MultSubArgs { MultSubArgs * ap = (MultSubArgs *)vp; } Oddity: threads can return values, but there's probably no one waiting for that value. We'll get there. Carving the array up to parallelize is pretty easy to imagine. Main thread can't write to disk after spawning threads because it has no way of knowing when the child threads are done. Before we get THERE, there are some practical issues with creating threads, because there are two ways of implementing them: user and kernel threads. User threads are very lightweight - they can be created and destroyed without making system calls. However, user threads only keep one CPU busy, and they ALL block when one of them makes a blocking syscall. Kernel threads need syscalls to create, but will keep multiple processors busy and don't all block when one makes a syscall. In pthreads, the distinction is "system scope" versus "process scope". We'll mostly want system threads in this course. What we know... - Within 1 thread, events happen in program order. That's about it. You can't say *anything* about relative execution order between threads. (You can create threads with more assumptions: non-preemptive threads get to keep the CPU away from other threads in the process once scheduled, but they "only encourage laziness and sloppiness on the part of programmers".) carve() creates "int count;" threads. (it fills in a global, boo) (it also initializes "int done;" to zero) At the end of multSub, done is incremented. When all threads calling multSub are done, done == count. Before we write to disk in the main thread: while(done < count) sleep(1); This doesn't work because "done++" probably ain't atomic. We need to enforce an ordering constraint: don't write until all threads are done. In 482, we would do this with a condition variable, and the sane way to do it is with join(). join(tid,result) blocks the caller until the thread with ID tid exits or returns. result is the thread return value. I think I missed something about "joinability" of a thread function and the necessity of indicating that in RME clauses. Instead of our old model, carve needs to keep track of an array of count thread ids. carve() ... for thread in threads: join(thread) write_to_disk() A thread that expects to be joined holds resources until it's actually joined. To prevent this, we can create threads in the detached state by passing an argument to pthread_create(), or we can detach them later with pthread_detach(). (the opposite of DETACHED is JOINABLE) Let's suppose that Scott, being quite clever, has convinced his compiler to do "done++" atomically. An action is "atomic" iff an external viewer can never see "part" of the action completed. Now consider: while(done < count) sleep(1); Well, this still might not work if the compiler is overly ambitious and puts done in a register. (yay C++) To fix, make done volatile, but I'm pretty sure There Be Dragons here. Volatile supposedly also prevents reordering, though it's not necessarily a full barrier.