EECS 498 - Concurrent & Parallel Systems WARNING: "I am making this up as I go." Best possible outcome: will know at the end how not to teach this course. "Think Parallel" - whenever you're doing something and there are multiple people around, try to keep them all busy. Sorting 280 exams: Do radix sort on first "digit", we've already divided the exam space mostly by uniqname. we also divide the letter space. Even stronger: Think Parallel AT SCALE. Scaling: as we increase processor count, how does performance improve? | / | / | / | / |/ ====== Ideally linear, but we don't get there because some work is serial & coordination between processors is overhead. Grading 280 exams: A few ways to divide this up. 1) Task Parallel: give each GSI 1 question to grade. Every GSI touches every exam BUT does something different! Usually, task parallelism shares a bunch of shared data. Works well if we actually have shared memory. We eventually set up pipelines: everyone grades a portion, people are the "question X" unit of the pipeline. We can assign multiple GSIs to one question (multiple processors to one task) if that question is the hardest. 2) Data parallelism: divide exams N ways, where N is the number of CPUs, and each person grades all of 1 pile. Data is totally partitioned and the tasks are identical. Also not necessarily optimal. This can cause cache misses: GSIs can hold an entire question in cache, but it's possible that the exam doesn't fit entirely in cache. Even if it DOES fit in cache, you incur the extra misses to fault in the larger pieces of data. So, a lot of what we're doing this term is combining the two above approaches. Again, RIGHT NOW TODAY, whenever thinking about doing something with one other person, think about how to keep both people maximally busy. Also think about whether you're data parallel or task parallel. Think about cache effects in your life. Fortunately, there are some easy ways of exploiting parallelism ("low-hanging fruit"): [partially because we now have more than one processor all the time] * The easiest way to do it is run more than one process at the OS level. Presto: parallelism! (Btw, a process is the running incarnation of a program, and it's the unit of sharing and of resource allocation. That is, within a process, everything is shared.) * Embarassingly parallel problems: (TBB book talks about this) don't be embarassed about these, it's a misnomer. It's a problem with little or no serial work and little or no communication overhead, but it does have shared state. Rant: computer scientists, for some reason, think that things that are really complicated must be really good. Evidence: see most of the artifacts we build. This is just stupid. As an example of a problem that's neither completely partitioned nor embarassingly parallel, consider producing a written document as part of a group. You can try partitioning the document, but you have to make sure it's coherent afterward. One way to solve that in technical writing is to agree on the interface between the parts. Worse, if you're doing this well, you have to read the document and make sure all the sections agree. Incidentally, if you throw everyone in a room to do this, you're thinking about putting all the processors on the same die, which DOES help. You could use a transaction model (see: shared codebase, change tracking). Human-level locking: one person says "I own the document, I will make changes and redistribute them." Sometimes people fail at locking. Example: tenure casebooks have very baroque rules about formatting to make the readers' jobs easy. One criterion we use about faculty members is the degree to which they are training graduate students, which is measured by publication. In the casebook, names of grad students are formatted so as to make everyone's status (YOURS, supervised jointly, graduated, undergrad) obvious, but it's hard to get right. Brian sent out a doc to the admin and said not to change it and she did anyway. [yay digressions?] Fundamental idea we'll be developing over time: we have some collection of processing units which have processing power & onboard storage (registers and hopefully cache). The processors are connected in some way to some shared state, and it could even be partitioned so that they only have direct access to some of the shared state. The intuition: for a running process, there are a small number of pieces of state that describe it (the registers and the stack). By contrast, in a concurrent system, each processor has its own copy of registers and the stack. Asynchronous I/O doesn't really keep multiple processors busy! We're working mostly with UMA (Uniform Memory Access) machines, we'll do a bit with cache-coherent NUMA as well. To make all this work, we need to use some amount of rigor. (Incidentally, as we get more interesting in the curriculum, students' code starts to look better. The longer it takes to realize you've coded yourself into a corner, the worse off you are. The same applies to realizing you're not in the top 10%.) Anyway, let's appeal to some material you probably forgot from 280: When building an ADT, the first thing you do is define the set of operations over abstract state. Second, choose a representation and identify representational invariants. (Yes, he brought up IntSet and SortedIntSet.) Finally, write the actual methods. Analagously, when thinking in parallel, we first think about where to find concurrency (can we divide the work up?). We have in mind abstract algorithms that we've got in libraries (TBB) which we'll be stealing as much as possible. Then we choose concrete implementations of the abstract hocus pocus, and finally implement and test. ---End intellectual introduction to the course--- So, what are we actually going to do? [Excellent question. Warning again: this is completely an experiment. Think graduate-level class.] Tools: * Pthreads * Intel TBB - Use as many shiny Intel tools as you like - VTune - Thread analyzer (can't tell 482 people, will find races, deadlocks, etc.)