*Most of these are works in progress
Concurreny and Multithreading Book (only following the course sections)
Chapter 1 - Introduction
(largely introduces terms and presents the alice and bob, pets in the yard analogy)
Computability:
- Figuring out what can be computed in an asynchronous concurrent enviornment
Concurrent Program/Concurrent Algorithm:
- The sequence of the thread operations on the object
Program Correctness:
- The specification and verification of what a given program actually does
Safety Property:
- States that some "bad thing" never happens
Liveness Property:
- States that a particular "good thing" will happen
Atomic:
- i.e. Indivisible
- An atomic instruction is, from beginning to end, indivisible
- i.e. A single hardware step
Mutual Exclusion:
- No two entities at once can use a resource
Deadlock-Freedom:
- If one entity wants to use resource, it will eventually succeed.
- If two entities want to use a resource, at least one will eventually succeed.
Starvation-Freedom (lockout-freedom):
- No entity is ever indefinitely prevented from using a resource
Transient vs Persistent Communication:
- Transient: both parties must participate at the same time
- Ex: A conversation. If one party is missing, everything said is lost
- Much less fault tolerant, simpler
- Persistent: Parties must not both be present
- Ex: An exchange of letters. If one party is out, their mail will be waiting for them when they return
- Fairly fault tolerant, more complex
Producer-Consumer (situation, property):
- Consumer (entity) will never access a "depleted" resource
- Producer (entity) will never supply a "non-depleted" resource
- The situation exhibited by processor filling communication buffers to be read/transmitted
Amdahl's Law:
- The extent to which any complex job can be sped up is limited by how much of that job must be executed sequentially.
- S : Maximum Speedup
- n : number of processors
- p : fraction of the job that can be executed in parallel
- S is then the ratio of sequential time (1 proc.) and parallel time (n procs)
- S = 1 / ( 1 - p + ( p / n ))
The old "veering to avoid someone in a hallway (and failing) issue" is provably unavoidable claim the authors!
Chapter 2 - Mutual Exclusion
Threads, Events:
- A thread is a "state machine", and its transitions are "events."
- Events are "instantaneous":
- They occur at a signle instant of time
- A thread A produces a sequence of events a0, a1, etc.
- An event a, preceding (in time) an event b, is denoted by:
- a -> b "a precedes b"
- An "interval" (a0, a1) is the duration between a0 and a1
- Invervals unrelated by -> (precedence) are considered "concurrent"
Critical Sections:
- A block of code that can be executed by only one thread at a time.
- Usually achieved by means of locks
- Lock before entering critical secton, unlock on exit
- Inherently, must be mutually exclusive
Well-Formed Thread:
- A thread using lock/unlock semantics must adhere to "well-formed" format:
- Each critical section is associated with a unique Lock object
- The threads calls lock on entering, and unlock on leaving crit. section
Good Lock Algo Requirements:
- Let CS(j,A) be the interval during which A executes the crit. section for the jth time.
- Mutual Exclusion:
- Crit. sections of different threads do not overlap.
- EX: For 2 threads A, B, either of the following must be true:
- [CS(k, A) -> CS(j,B)] or,
- [CS(j, B) -> CS(k,A)]
- Freedom from Deadlock (see above for def):
- If thread A calls lock but never gets it, then other threads MUST be completing an infinite number of crit. sections.
- Freedom from Starvation (see above for def):
- Every lock() call eventually succeeds
*NB: Starvation freedom IMPLIES deadlock freedom.
The Filter Lock:
- Designed for n threads where n > 2
- Creates n-1 "waiting rooms", called "levels"
- A thread trying to acquire the lock must traverse each level
- Levels satisfy two properties:
- At least one thread trying to enter level l succeeds
- If more than 1 thread is trying to enter l, at least 1 is blocked
- Deadlock Free!
Fairness:
- Starvation-Freedom makes no guarantees about how long a thread might wait
- In order to build toward fairness, divide Lock() method into two sections:
- A "doorway" section w/ execution interval D(A) consists of a bounded # of instructions/steps
- A "waiting" section w/ " " W(A) consists of an unbounded number of instructions/steps
- Doorway's boundedness is called "Bounded Wait-Free" progress property
Fairness: A lock is "first-come first-served" if, whenever A finishes its doorway before B starts its doorway, A cannot be overtaken by B.
Lamport's Bakery Algo:
- Maintains the first-come first-serve fairness property
- Emulates the numbered ticketing systems at delis
- Each requestor takes a number, and the lowest number enters if able
- Deadlock Free!
- Satisfies mutual exclusion
- Round Robin label (ticket #) checking would scale verryy badly...
Lower Bounds on the Number of Rs/Ws given Bakery type lock:
- Can't improve on n (number of threads)!
- Any deadlock free algo requires allocating or R/Wing at least n spots in worst case.
Types of State:
- Object: state of its fields
- Thread: state of its program counters and local vars
- Global: state of all objects + local states of threads
In terms of Lock Objects:
- Inconsistent: if any thread is in crit. section, but the lock's state is compatible with a global state where no thread is in a crit. section. (obvious, and bad)
- No deadlock free algo can ever produce this
- Covering: at least one thread is about to write to a shared mem location, but the lock object suggests that the crit. sections are empty.
- Result: Any lock algo that uses memory R/Wing to solve deadlock free mutual exclusion of n threads, must use n seperate locations
Chapter 3 - Concurrent Objects
Brief: "The behavior of concurrent objects is best described though their safety and liveness properties, often referred to as 'correctness' and 'progress'."
Some terminology related to objects:
- "Precondition": an object's state before invoking a method
- "Postcondition": an object's state upon return of method, ret. value
- "Side Effect": alternative for a change in state, suggests unintended
- "Sequential Specification": if state x, and meth. y is invoked, z will happen, etc etc
- "Total Method": behavior defined for all possible object states
- "Partial Method": all methods not Total (~Total Method)
Operational def. of method call:
- "..the interval that starts with an 'invocation' event and ends with a 'response' event."
Quiescent Consistency:
- "Pending": method call state after call event, but before response.
- Requirement: Meth. calls should appear to happen sequentially
- "Quiescent": State of an object that has no pending meth. calls
- Meth calls separated by a period of quiescence should appear to take effect in their real-time order.
- In sum: "..any time an object becomes quiescent, execution up to that point has been sequential."
- Not covered in class
Sequential Consistency:
- "Program Order": order in which a single thread issues meth. calls
- Requirement: Meth. calls should appear to happen sequentially
- Requirement: Meth. calls should appear to happen in program order
- Together, these requirements define a "correctness property" called sequential consistency
Linearizability:
- Problem: "..the result of composing sequentially consistent components is not itself necessarily sequentially consistent."
- Solution: Replace the requirement that method calls appear to happen in program order w/:
- "Each method call should appear to take effect instantaneously at some moment between its invocation and response."
- Linearizability Property: real-time behavior of method calls must be preserved.
Linearization Point (LP):
- A way to show that an object's impl. is linearizable
- The "point" is the place at which the method takes effect
- W/ Locks: each method's critical section can serve as LP
- W/O Locks: LP typically a single step where a method call's effects can be seen by other method calls
Linearizability & Consistency:
- Sequential Consistency:
- A good way to describe standalone systems, such as hardware memories.
- Non blocking
- NOT compositional
- Linearizability:
- A good way to describe components of large systems where independent components must be impl. verified independently
- Non blocking
- Compositional: the result of composing Linearizable objs is Linearizable
- The appropriate correctness condition for systems where concurrency and real-time response are important
Informal Definition of Linearizability:
- Linearizable if "..each method call appears to take effect instantaneously at some moment between the method's invocation and return events."
The Java Memory Model (JMM):
- Does not guarantee linearizability, or even seq. cons. when r/w on some shared object's fields.
- Guaranteeing the above would outlaw many common compiler optimizations
- Does satisfy the "Fundamental Property" of relaxed memory models:
- "If a program's sequentially consistent executions follow certain rules, then every execution of that program in the relaxed model will still be seq. consistent."
- In the JMM:
- Objects reside in shared memory.
- Each thread has a private working memory w/ cached copies of fields it has read/written.
- W/O synchronization, a thread that writes to a field may not propagate that write to memory right away. Similar for reads.
- No requirement to keep thread's cached/shared memory consistent
- Guarantees only that a thread's private read/writes appear in order to the thread.
- Synchronization Events:
- In Java, implies synchronizing thread's priv. memory w/ main/shared memory.
- Thread's cached fields invalidated, required re-reads, or cause cached changes to be "written through"
Locks & Synchronization Blocks:
- A thread can achieve mutual exclusion:
- Implicitly, by entering a synchronized block/method
- Explicitly, by acquiring a lock (eg. ReentrantLock)
- If all R/Ws to a particular field are protected with the same lock, then R/Ws to that field are linearizable:
- Cond. 1: On release, modified fields in private memory are written back to shared.
- Cond. 2: At acquisition, a thread invalidates its working memory, to ensure current values are read from shared mem.
Volatile Fields:
- Volatile fields are linearizable.
- Reading Volatile field similar to acquiring a lock:
- Working memory invalidated & field's value reread from shared mem
- R/W on V. field has same effect on mem. consistency as acquiring lock, but:
- Multiple R/W on same field are not atomic
- Ex: x++ on V. var. x, is not thread safe if multiple threads call
- Some form of mutual exclusion is also required
- Linearizable memory s.a. AtomicInteger:
- compareAndSet() ==> volatile write
- get() ==> volatile read
Final Fields:
- final Type k = V;
- k cannot be modified after initialization
- By following simple rule(s), final fields can be read by other threads w/o synchronization:
- The "this" reference cannot be released from the constructor until after the constructor returns
NB. "R/Ws to fields are linearizable if either the field is volatile, or the field is protected by a unique lock acquired by all readers/writers."
Chapter 4 - Foundations of Shared Memory
The Space of Registers:
- "Read-Write" Register (register):
- an object that encapsulates a value
- value can be observed with read() (ie. load)
- value can be set with write() (ie. store)
- Because an object's impl. is "wait-free":
- "wait-free": each method finishes in finite steps
- It guarantees independent progress
- We need only require that register impls. be wait-free
- Would then also be lock-free (wait-free > lock-free)
- "Atomic" Register: A linearizable impl. of a seq. register (R-W)
- SRSW: "Single write, single read"
- MRSW: "Multi-reader, single writer"
- MRMW: "Multi-reader, multi-writer"
Chapter Addresses:
"Can any data structure implemented from the most powerful registers we define also be implemented from the weakest?"
Strengths of Synchronization:
- Weakest form of persistent sync:
- (Arguably) Ability to set a single persistent bit in memory
- Weakest form of sync:
- (Unarguably) No sync. at all
- The "Power" of a register comes from the strength of the guarantees it provides.
- A MRSW reg impl. is "safe" if:
- A read() call that doesn't overlap a write() call returns value written by most recently called write()
- Otherwise, if a read() overlaps a write(), the read() may return any value in the register's allowed range (eg. <Bool>: {T,F})
- Here, the term "safe" is relative. They are "safe" in a sense
- A type of register "safer" than safe, less "safe" than atomic:
- A "regular" register: MRSW reg. where writes do no happen automatically.
- Regular reg's values "flicker" between old and new value during write() call:
- A regular reg. is safe: reads not overlapping writes return most recently written value
- A read() overlapping a write() may return any value between the previously written value, and the last value written by the overlapping write().
- A regular register, precisely:
- (4.1.1) No read call returns a value from the future:
- NEVER true that: Ri -> Wi
- (4.1.2) No read call returns a value from the distance past:
- NEVER true that, for some j: Wi -> Wj -> Ri
- An atomic register, proceeding from regular definition:
- (4.1.3) Atomic if (in addition to 4.1.1/4.1.2):
- if Ri -> Rj, then i <= j
An Atomic SRSW Register:
- First, we construct an atomic SRSW reg from a regular SRSW reg:
- Need to satisfy 4.1.3.
- Since SR (single reader) implies no overlapping reads:
- 4.1.3 only violated if two reads overlapping a write read the values Vi then Vj, where j < i.
- Add a timestamp to each value.
- Each read remembers the latest (highest) timestamp/value pair read, making it available to future reads.
- This information can prevent 4.1.3 violations
- Adding a timestamp requires that a value and a timestamp can be read/written as a single unit.
- In C, this could be done by packing both components in a single binary value
- In Java, can use StampedValue<T>:
- Structure holds a timestamp/value pair
- A reference to this structure can then be stored in the register
An Atomic MRSW Register:
- Issue: Suppose we have an array of atomic SRSW registers
- 4.1.3 holds for any single reader, but not for multiple distinct readers.
- A write sets ar[0], but is then delayed before writing ar[n-1]
- A read by thread 0 returns the new value
- A subsequent read by thread 1 reads and returns the old value
- This can be addressed by threads "helping out" later threads
- Solution: single array becomes an n x n array for n threads
- ar[0,n-1][0,n-1] of stamped values
- a write writes along diagonal, eg. ar[x][x] for 0 < x < n
- ar[A][B] where A == B are actual values
- ar[A][B] where A != B are for thread communication
- Each reader A, after reading ar[A][A] checks all other columns (eg. ar[B][A] for all B) to see if it contains a value w/ higher timestamp.
- The reader then writes its own value (including new timestamp) into its row (eg. ar[A][B] for all B), to let future threads know when its last read was.
- This allows for the detection/prevention of reads overlapping writes returning out of order values (4.1.3 holds)
An Atomic MRMW Register:
- To write to a register, A reads all array elements, chooses a timestamp > any observed, and writes that stamp + value to element A
- To read from a register, a thread reads all elements, and returns the one with the highest timestamp.
- Like in the bakery algo, ties are resolved in favor of lower threadID
Take Away:
- These constructions show that you can build a wait-free MRMW multi-valued atomic register from SRSW safe Boolean registers.
- They show that any algo using atomic registers can be implemented on architecture that supports only safe registers.
Chapter 5 - The Relative Power of Primitive Sync. Operations
Point: Need to develop a set of sync. primitives that can ensure sync. for all needs in the system. Can compare these primitives based on their "power".
Brief "If one thinks of primitive synchronization instructions as objects whose exported methods are the instructions themselves (these objects often called 'Synchronization Primitives'), one can show that there is an infinite hierarchy of synchronization primitives, s.t. no primitive at one level can be used for wait-free or lock-free implementation of any primitives at higher levels."
Point: Each class in this hierarchy has a 'consensus number', or the maximum number of threads for which for which it can guarantee 'consensus'
Consensus Numbers:
- A consensus object provides a "decide" method:
- Accepts a thread's input (v)
- Returns value which is both:
- Consistent: All threads decide the same value
- Valid: the common decision value is some thread's input
- Chooses from among callers
- Returns chosen caller's input (v)
- "..a concurrent consensus object is linearizable to a sequential consensus object in which the thread whose value was chosen completes its decide() first."
- Wait-free solutions to consensus problems desired
- Because decide() exec. only once per thread:
- Any lock free impl. will also be wait free
- "Consensus Protocol": (in book) any class that impls. wait free "consensus"
Definition 5.1:
- A class C solves n-thread consensus if there exists a consensus protocol using any number of Cs and any number of atomic registers
Definition 5.2:
- The consensus number of C is the largest n for which C solves n-thread consensus.
- If no largest n exists, C's consensus number is infinite.
Corollary 5.1:
- Suppose: class C implemented from one or more D (other class) + some atomic registers.
- If class C solves n-thread consensus, then so does D.
States & Valence - Definitions & Simplest Case:
- Simple case of two threads w/ inputs of 1/0
- Each thread "moves" (calls methods of shared objects) until it decides on a final value
- "Protocol State": states of threads + shared objects
- "Initial State": protocol state before any thread has moved
- "Final State": protocol state after all threads have finished
- "Decision Value" (of any final state): value decided by all threads in that state
States & Valence - State Tree
- A wait-free protocol's set of possible states form a tree where:
- Node: possible protocol state
- Edge: possible moves by some thread
- NB. Any wait free protocol's tree is finite.
States & Valence - Valence:
- Bivalent: a protocol state where the decision value is not yet fixed
- NOT ALL descendants in tree result in same decision
- Univalent: a protocol state where the decision value is fixed
- Every execution reaching this state will decide on same value
- ALL descendants in tree result in same decision value
- Lemma: "Every 2-thread consensus protocol has a bivalent initial state."
- Simple proof: think if A is so slow it never calls decide(), vs. situation where B is so slow it never calls decide(). Both plausible, and would result in different decesions.
- "Critical" State requires:
- The state is bivalent
- If any thread moves, the protocol state becomes univalent
Atomic Registers:
- No way to solve consensus using atomic registers
- Useful scenario:
- "Solo": Thread A runs completely by itself until it finishes.
- Theorm 5.2.1: Atomic registers have atomic number 1.
- Some more investigation here... (0-valent state?)
- Corollary 5.2.1: It is impossible to construct a wait-free implementation of any object with consensus number > than 1 using atomic registers.
Consensus Protocols:
- Authors consider a variety of object classes, considering how well each can solve the consensus problem.
- The object has an array of atomic registers in which each decide() method proposes its input value, and then goes on to execute a sequence of steps in order to decide on one of the propsed values.
FIFO Queues:
- In chapter 3, we saw a wait-free FIFO queue impl. using atomic registers, where only 1 thread could enqueue/dequeue.
- Can we provide a wait-free impl. of a two-dequeuer FIFO using atomic registers?
- Theorem 5.4.1: The two-dequeuer FIFO queue has consensus number >= 2.
- In the two-dequeuerer FIFO algo:
- queue is intialized by enqueuing WIN, followed by LOSE
- decide():
- calls propose(v): stores v in proposed[]
- proposed[]: shared array of propsed input values
- dequeues the next item from queue.
- if dequeued item == WIN:
- calling thread was first, can decide its own value
- if dequeued itme == LOSE:
- the other thread was first
- caller returns other thread's input, from proposed[]
- Wait-Free: Algo contains no loops (eg. wait for ... )
- Validity: condition holds as the dequeue of WIN stored its value in proposed[] before anything was dequeued.
- Can be varied trivially to yield protocols for stack, list, etc.
- Corollary 5.4.1: It is impossible to construct a wait-free impl. of a queue, stack, prio-queue, set or list from a set of atomic registers.
- The FIFO queue described above solves 2-thread consensus, it cannot solve 3+
- Theorm 5.4.1: FIFO queues have consensus number 2.
Read-Modify-Write Operations:
- "Many, if not all, of the classical synchronization operations provided by multiprocessors in hardware can be expressed as 'read-modify-write' (RMW) operations, or, as they are called in their object form, 'read-modify-write registers'."
- Consider a RMW register encapsulating integer values where F is a set of functions from integers -> integers.
- A method is an RMW for the function set F if it automatically replaces the current register value v, for some f in F, and returns the original value v.
- By Java convention:
- An RMW method that applies f in F "mumble": getAndMumble()
- Java's atomic integer provides rich method set (as example)
NB. "..RMW methods are interesting precicisely because they are potential hardware primitives, engraved not in stone, but in silicon."
- A RMW method is "nontrivial" if its set of functions includes at least one function that is not the identity function.
- Theorem 5.6.1: Any nontrivial RMW register has concensus number >= 2
- Corollary 5.6.1: It is impossible to construct a wait-free impl. of any nontrivial RMW method from atomic register for 2 or more threads.
Common2 RMW Operations:
- Author's no consider the "Common2" class of RMW registers.
- These corresond to many common sync. primitives on modern procs.
- Though they're move powerful than atomic registers, they still have consensus number of exactly 2, indicating their limited sync. power.
- Definition 5.7.1: A set of functions F belongs to Common2 if for all values v and for all Fi, either:
- Fi, and Fi commute: Fi(Fj(v)) = Fj(Fi(v)), or
- one function overwrites the other, eg. Fi(Fj(v)) = Fi(v)
- Definition 5.7.2: A RMW register belongs to Common2 if its set of functions F belongs to Common2
- Informally, Common2 RMW registers cannot solve 3-thread* consensus, because though a winner can be determined, and losers can be determined, distinguishing the losers from eachother isn't possible.
- Theorm 5.7.1: Any RMW register in Common2 has consensus number exactly 2.
The compareAndSet() Operation:
- compareAndSet() method supported on many archs:
- CMPXCHG on Intel's Pentium arch
- AKA compare-and-swap
- Takes 2 args:
- An Expected value
- An Update value
- If Expected == Actual:
- Write Update Value to field
- Return Boolean:
- True if field was updated (Expected == Actual)
- False otherwise
- Theorm 5.8.1: A register providing compareAndSet() and get() methods has an infinite consensus number
- Corollary 5.8.1: A register providing only compareAndSet() has an infinite consensus number.
Chapter 6 - Universality of Consensus:
Universality of Consensus:
- A class of objects
- Defined:
- Given sufficiently many of them, one can construct a wait-free linearizable implementation of ANY concurrent object.
- A machine architecture or programming language is computationally powerful enough to support arbitrary wait-free sync. if and only if it provides objects of a universal class as primitives.
- Eg. those that provide compareAndSet() are universal for any number of threads.
Universality:
- Again defined: "A class C is universal if one can construct a wait-free impl. of any object from some number of objects of C and some number of read-write registers."
- Rationals:
- Allowed to use many read-write regs as memory is plentiful on modern archs.
- Doesn't take into account memory waste, but trivial from an abstract perspective (implied by authors...)
A Lock-Free Universal Construction:
- Assume sequential objects are deterministic
- We represent each object as a combination of the object in its initial state, and a "log"
- Log: a linked list of nodes representing the sequence of method calls applied to the object (hence, the object's sequence of state transitions)
- Bulding Blocks:
- Invocation: describes the method being called + args
- Response: describes a method's termination conditions, and ret value
- Eg. invocation{push(), 42}, response{normal, void}
- eg. push(42) => normal (success/fail, not exception), null
- No reasonable return value for push()
- A thread executes a method call by adding the new call to the head of the list.
- The thread then traverses the list, tail to head, applying the method calls to a private copy of the object
- The thread finally returns the result of applying its own operation
What about concurrent safety?
- A thread attempting to call apply() creates a node to hold its invocation.
- The threads then compete to append their respective nodes to the head of the log by an n-thread consensus protocol to afree which node has been appended to log.
- The inputs to this consensus are refs to the threads' nodes
- The result is the unique winning node
- The winner proceeds to compute its response by:
- Creates a local copy of the sequential object
- Traverses the log from tail to head, applying each operation to its copy
- Returns the response associated with its own invocation
- Works even apply() called by multiple threads because:
- The prefix of the log up to the thread's own node never changes
- The losing threads must try to set the node currently at the head, which changes between attempts, to point to themselves.
Further Caveats, Notes:
- Initially, the log consists of a unique sentinel node w/ seq# 1
- How to keep track of the log's head?
- Use a per-thread structure like the bakery algo's.
- Use an n-entry array head[], where head[i] is the last node in the list that thread i has observed.
- Initially, all entries refer to sentinel node
- The head is the node with greatest seq# in the head[] array
- Call max() on head[] to get head
- Each apply() call can be linearized at the point of the consensus call adding the node to the log.
A Wait-Free Universal Construction:
- How do we make the lock-free algo wait-free?
- We must ensure that every thread completes apply() in a finite number of steps, ie. no thread starves.
- To ensure this, threads must "help" eachother (slower threads)
- Helping:
- Each thread must share with other threads the apply() its trying to complete.
- Impl. w/ the addition of an n-entry announce[] array, where announce[i] is the node thread i is currently trying to append
- Initially all announce entries refer to sentinel node, w/ seq# 1
- A thread "Announces" a node when it stores the node in announce[]
-Execution of apply():
- Thread announces its new node.
- Ensures that even if the original thread fails to append, another can try on its behalf
- Now, in contrast to the lock-free version, the thread checks the announce[] to see if there is a node needing help ahead of it
- Seq# % n == thread to try to help
- if announce[seq# % n] doesn't need help (seq# == 0), append thread's own node
- This is suffificient to prevent any "wait forever" scenario
- From here on, execution is the same as if it were just lock-free
- Caveats:
- How to deal with the potential for two threads to append the same node and inc the seq# at the same time?
- The order in which a thread reads max(head[]) and then reads the seq# of a node in announce[].
- Before any node is added to head[], its seq# is set != 0
- This alone allows the identification of "double" append
- Linearizability:
- Due to the above (no node is added twice) it follows that:
- the order of the log is compatible w/ the natural-partial order of the corresponding method calls.
Chapter 7 - Spin Locks and Contention:
"Any mututal exclusion protocol poses the question: What do you do if you can't acquire the lock?"
- Two alternatives:
- Spin Lock: Keep trying to acquire the lock ie. spinning, busy-waiting
- Filter and Bakery algos are Spin Lock based
- Generally: Used if wait time is assumed to be very small
- Blocking: Suspend execution, and let scheduler schedule another thread on the proc.
- Generally: Used if wait time is assumed to be long
- Spinning wastes cycles directly, but context switches are expensive
- Many archs do both
Welcome to the Real World:
- Authors present example of a simple program with 2 threads that each acquire a Peterson lock, increment a shared counter, and unlock.
- We assume that when this process finishes, the shared counter will be 2*number of increments.
- Our assumptions of Peterson Lock correctness require that any two memory access by the same thread will occur in program order
- However, this isn't necessarily true!
- Why not?
- Compiler optimizations that reorder instructions
- Most languages preserve program order for individual variables, but not across multiple variables
- Microprocesser hardware:
- Many (if no most) programs don't need writes to immediately be reflected in memory.
- Many vendors therefor massively cache writes, and only write them "through" occasionally.
- This fact allows two threads to potentially operate in correct order, but not maintain synchronicity.
- What to do?
- Memory Barriers (Fences):
- Special instructions (eg. mfence) that forces outstanding operations to take effect.
- Expensive. Nearly as costly as atomic compareAndSet()
- Ex. Peterson's lock can be fixed by calling mfence before each read
- Since Fence cost is so high anyway, and since compareAndSet() and its cousins have higher consensus numbers than reads and writes, they can (should) be used to reach consensus.
Test-And-Set Locks:
- testAndSet() (consensus # == 2) was the principal sync. instruction provided on many early multiprocessor archs.
- Works on a single memory word (byte)
- That word holds a binary value (true, false)
- Automatically stores true in the word
- Returns the word's previous value: swapping true for actual
- Lock() then applies testAndSet() until it returns false
- Unlock() just writes false to the word
- testAndTestAndSet()
- replaces while(state.getAndSet(true)){} w/:
{ while(true):
while (state.get()){}
if (!state.getAndSet(true)):
return
}
- Performance significantly better than TAS!
- Notes on Multiprocessor architecture (general, modern):
- Processors communicate by way of a shared broadcast medium (bus)
- "Like a tiny ethernet"
- Both processors and the memory controller can broadcast on bus
- Only entity (proc/mmu) can broadcast at a time
- All can listen (snoop)
- Each proc has a cache
- On proc cache miss, proc broadcasts addr on bus
- Proc snooping with the addr in their cache broadcast addr/value
- If no proc broadcasts response, memory controller responds
TAS-Based Spin Locks Revisited:
- TAS vs TTAS on shared-bus architecture
- TAS:
- Each getAndSet() is broadcast on bus
- Each thread needs bus => each getAndSet() calls slows all threads
- Each getAndSet() also forces all procs to flush cached versions of the lock:
- So spinning threads cause cache misses almost everytime
- Causing their proc to need to broadcast on bus to get the new, yet unchanged, value
- Release of lock is also slowed by all the bus traffic, further degrading performance
- TTAS:
- Situation: Thread A holds lock, Thread B trys to acquire
- 1st attempt results in cache miss, and resulting fetch
- Subsequent attempts will hit cache, and not cause bus traffic
- This is also reduces the TAS unlock overhead, due to less traffic
- Situation: A thread releases the lock
- Lock holder writes false to lock var, invalidating spinner's cached copies
- Each Spinner incurrs a cache miss to reread lock value
- Each Spinner then calls getAndSet() to try and acquire
- The first to succeed invalidates the other, causing more misses
- This bus "storm" of traffic, per release, degrades performance
Exponential Backoff:
- Some Terminology:
- Contention: when multiple threads try and acquire a lock at the same time
- High/Low Contention: many threads/few threads vying for a lock
- Back off: time during which a thread won't try to acquire a lock
- Generally:
- Global MIN/MAX_delay set (smallest allowed rand, greatest wait time)
- Choose random first time (to prevent threads for tying at the same time)
- Everytime thread fails see the lock as available, but then fails to write to it:
- Indicates contention (not just another thread holding)
- Thread doubles initial time, and blocks for duration
- Notes:
- Author's recount experience suggesting that the MIN/MAX_delay optimal values are highly dependent on the number and speed of the processors in a system.
- This makes it somewhat hard to make a portable version
- Problems:
- Cache-Coherence Traffic (CCT):
- All threads spin on same shared location, causing cache-coherence traffic on every lock access
- Critical Section Underutilization (CSU):
- Threads delay longer than necessary, causing critical section to be used less than it could be
Queue Locks:
- Direct response to the problems with Exponential Backoff noted above
- Idea: Threads form a queue
- Queue:
- Each thread can learn if it is its turn by checking if predecessor is finished
- CCT reduced by threads spinning on different locations
- CSU improved as there is no guessing. Threads are notified directly when they can enter a critical section
- Achieves fairness via first-come, first-serve paradigm
- Array-Based Locks:
- Threads all share an AtomicInteger tail field
- To acquire lock:
- Each thread automatically increments tail
- Increment result is the thread's "slot"
- Slot used as index into Boolean flag array
- if flag[i] == True, then thead w/ slot == i can get lock
- flag[0] initially set to true
- A thread therefor spins until its slot in flag[] == true
- To release lock:
- A thead holding the lock sets its own slot in flag[] = false
- It then sets the its own slot + 1 in flag[] == true
- Allow successor to access
- Math all done modulo n (#threads)
- Thread-Local Variable:
- Each thread has its own, independently initialized copy
- Need not be stored in shared memory
- Do not require synchronization
- Do not generate any CCT as they are accessed by only 1 thread
- Accessed by get()/set()
- False Sharing:
- Contention caused by adjacent data items (s.a. array elements) sharing the same cache line.
- A write to one, invalidates a cache line which may contain adjacent elements, causing invalidation traffic
- Addressable by:
- Padding arrays so that no two elements will fall in same line
- eg. ar = [1,x,x,x,2,x,x,x,3,x,...,10]
- Problems:
- Not space efficient
- Requires known bound n on the max. number of threads, and allocates an array of that size per lock
- Therefor, synchronizing L distinct objects requires O(Ln) space, even if a thread accesses only one lock at a time
- The CLH Queue Lock:
- Records each node's status in a QNode object
- Has Boolean locked field
- True: that thread has, or is waiting for the lock
- False: that thread has released the lock
- Lock itself represented as a virtual linked list
- Each node refers to its predecessor via pred field
- Public tail field hold reference to most recently added node
- Operation:
- A thread calls getAndSet() on tail field to make its own node the list's tail + sets pred correctly
- It then spins on pred's locked field
- To release a lock, a thread sets its locked field = false
- It then reuses its predecessor's QNode as its new node for future accesses. It can do this because:
- "At this point, the thread's predecessor's QNode is no longer used by the predecessor, and the thread's old QNode could be reffed by both the thread's successor, and by tail"
- Benefits:
- If there are L locks, and each threads accesses at most one at a time, then the CLHLock class needs only O(L+n) space vs. O(Ln) space required by ALock
- Doesn't require preknowledge on the number of threads
- Problems:
- Performs poorly on cacheless NUMA archs.
- If each thread spins on its predecessor's locked field, and that memory location is remote, performance will suffer.
- The MCS Queue Lock:
- Similar setup to CLH:
- QNodes are arranged in a linked list
- QNode's locked field indicates locker holder, or waiting for
- Unlike CLH:
- List is explicit:
- Emboddied in globally accessible QNode objects via their next field
- To Acquire Lock:
- A thread appends its own QNode at the tail of the list
- If queue was not empty, it sets predecessor's QNode's next field to refer to itself (update list..)
- It then spins on a (local) locked field in its own QNode, waiting until its predecessor set this field to false
- To Unlock:
- First checks if its node's next field == null
- If null, either:
- No other thread is contending for lock, or
- There is another thread, but its slow
- The following steps deal with these cases
- Unlock() calls compareAndSet(this.QNode, null) on tail
- If compareAndSet() succeeds, no other thread is trying to acquire the lock, else there is one and our thread spins, waiting for it to finish
- Once a successor has appeared:
- Unlock() sets its successor's locked field = false
- We now have no refs to this QNode, and it can be reused
- Benefits:
- Shares the benefits of CLH:
- Each lock release invalidates only the successor's cache entry.
- MCS is better suited to cacheless NUMA archs as each thread controls the location it spins on
- Like CLH, nodes can be recycled:
- Results in O(L + n)
- Problems:
- Releasing a lock requires spinning.
- There are more reads/writes/compareAndSet() calls than CLH algo
A Queue Lock with Timeouts:
- Java's Lock interface includes 'tryLock(timeout)'
- TOLock - Java's CLHLock + timeout functionality
- Allows callers to attempt to lock for a period, timeout
- Or to "abandon" the lock
- Abandonment:
- BackoffLock:
- Trivial, Wait-Free
- Queue-based:
- Non-Trivial, Not Wait-Free, thread behind one which abandons will starve.
- Queue + timeouts (a lazy approach):
- A thread that abandons marks its node as such, allowing any successor nodes stop spinning on it, and instead spin on the abandon node's predecessor.
- TOLock Advantages:
- Local spinning on cached location (CLHLock)
- Quickly detects lock free (CLHLock)
- Wait-Free timeout property (BackoffLock)
- TOLock Disadvantages:
- New node allocation each time lock is accessed
- A thread spinning may need to spin on several locations before a non-abandonded node/AVAIL node is reached and it can proceed.
Chapter 8 - Monitors and Blocking Synchronization
"Monitors are a structured way of combining synchronization and data."
Monitor Locks and Conditions:
- Terminology:
- Only one thread at a time can 'hold' a lock
- A thread 'acquires' a lock when it first starts to hold the lock
- A thread 'releases' a lock when it stops holding it
- If a thread can't immediately acquire a lock, it can:
- Spin: repeatedly test for a particular state
- Block: give up the processor, yield for a while
- When to do what:
- Block: If the thread is likely to wait for a long time
- Spin: Multiprocessor + likely short wait time
- Both: Often best option, eg. spin for a short while, then block
- Conditions:
- Why?
- A thread waiting for, say, another thread to enqueue something in a shared queue, needs to give up the lock on the queue in order for some other thread to actually enqueue something, and so must somehow be notified when such an item has in fact been enqueued.
- In Java, Pthreads, etc, this is resolved with 'conditions', or the ability to release locks temporarily
- Issue: No guarantee that the condition will hold when thread is awakened (after calling await() for instance)
- A waking thread must test for the condition
- Java 'Condition' flavors:
- Specification of a max. time the caller can be suspended
- Specification of whether or not the thread can be interrupted
- Reentrant: "..a thread that is holding the lock can acquire it again without blocking."
- Monitor (more precisely): The combination of methods, mutual exclusion locks and condition objects.
- The Lost-Wakeup Problem:
- The deadlock of condition objects: a thread or threads which continue waiting even after the condition they're waiting for has become true.
- Anti Lost-Wakeup best practices:
- Always signal ALL processes waiting on a condition, not just one
- Specify a timeout when waiting
- Synchronized blocks are Java's built-in support for monitors
Readers-Writers Locks:
- Observation (paraphrased): Many shared objects have the property that most method calls (readers) return information about the object's state w/o modifying the object, while a comparatively few number of calls (writers) actually modify the object.
- Remark: Readers need not lock the object to read, but writers must lock out both other writers and readers.
- A 'readers-writers' lock allows this
- A ReadWriteLock{} interface consists of a read lock, and a write lock
- Such an interface would satisfy the following safety properties:
- To acquire the write lock, no other thread can have either a read, or write lock
- To acquire the read lock, no other thread can have a write lock
- Some Caveats:
- Supposing that reads are more likely that writes, it might be the case that successive reads cause writes take a long time to complete, if they ever do.
- Writer's attempt to acquire lock prevents any new reads until the writer has completed its write.
Our Own Reentrant Lock:
- Locks discussed in chaps {2,7} will deadlock with themselves if they attempt to reacquire their own lock.
- Ex. Method acquires lock, makes nested call to another method that acquires the same lock
- A lock is 'reentrant' if it can be acquired multiple times by the same thread.
Chapter 9 - Linked Lists: The Role of Locking
"This chapter introduces several useful techniques that go beyond coarse-grained locking to allow multiple threads to access a single object at the same time."
Fine-grained synchronization:
- Split an object into individually lockable components
Optimistic synchronization:
- Objects like a tree are often searched. Optimistic sync on such an object might mean that the tree is not locked for the search, only if the desired node is found.
- Only works if this succeeds more than half the time, ie. 'optimistic'.
Lazy synchronization:
- "Sometimes it makes sense to postpone hard work."
- Ex. removing a component can be split into two parts:
- Logically removing the component (set a bit)
- Physically removing the component (unlink from rest of structure)
Nonblocking synchronization:
- Eliminate locks completely by using built-in atomic ops ie. getAndSet()
Authors note that reasoning about concurrent data structures can seem "impossibly difficult", but a good place to start is to look at a structure's 'invariants'.
- Invariant (a property that always holds):
- Holds when the object is created
- No thread may take a step that makes the property false
Some terminology (all w.r.t. list algorithms):
- Freedom from Interference:
- The assumption that the methods in question are the only ones which will modify nodes (no external interference)
- Abstract Value:
- A set of item (given we're talking about lists...)
- Concrete Representation:
- A list of nodes
- Representation Invariant:
- Characterization of what representations make sense as abstract values
- "A contract among the object's methods"
- Abstraction Map:
- Maps lists that satisfy the representation invariant to sets
- Ex. W/ Lists: an item is in the set if and only iff it's reachable from HEAD.
Useful invariants w.r.t. lists (per chapter examples):
- Sentinel nodes are neither added nor removed
- Nodes are sorted by key, and keys are unique
NB. W.r.t. these list (set) algos:
- The desired safety property is linearizability.
- They make many different progress guarantees
Coarse-Grained Synchronization:
- Set with add/remove methods, each of which uses the try/finally lock paradigm.
- A single reentrant lock used
- Even w/ good lock, contention will cause delays
Fine-Grained Synchronization:
- Add a lock to each node.
- When a thread traverses the list, it locks each node when it first visits it, and later releases it.
- This allows for multiple threads to traverse list at same time, in a kind of 'pipelined' fashion
- Acquiring the locks must be done in "hand-over-hand" fashion:
- In order to facilitate removal, addition, which involves not only a single node, but also the pred field of the successor node, a thread cannot give up the CURR lock, without first having acquired the PRED lock.
- Otherwise, another thread could remove a node in the time in between unlocking CURR and locking PRED.
- Sometimes called "Lock Coupling"
- Maintaining a constant order in which each thread acquires locks ensures no deadlock.
Optimistic Synchronization:
- Fine-grained locking improved over coarse, but still potential for long sequence of lock acquisitions and releases.
- Ex. A thread removing node 2 blocks all threads from concurrently searching past node 2
- Take a chance: search without acquiring locks:
- Lock the nodes if found
- Confirm locked nodes are correct
- If nodes are wrong, unlock and try again
- The optimism here is the assumption that the nodes will be right more often than they are wrong.
- Clearly, Freedom from Interference is crucial to this process, and it, in turn, "relies on garbage collection to ensure that no node is recycled while it is being traversed."
- As the most common set method, having contains() w/ locks isn't ideal
Lazy Synchronization:
- Optimistic synchronization benefits from the cost of traversing the list twice without locking (search + validate) is significantly less than one traversal with locking.
- To improve on optimistic, and remove locks from contains():
- Add a 'marked' field to each node, indicating if it is in the set
- No need to validate reachability of a node, or lock the node
- Contains() needs only one wait-free traversal.
- Remove() now removes 'lazily':
- Logically removing the node by marking it
- Physically removing the node by redirecting its predecessor's next field.
- Advantages:
- Separates unobtrusive logical steps, like flag setting, from obtrusive, expensive, physical steps like changing node references.
- Physical steps can be batched and scheduled at off-peak times
- Principal Disadvantage:
- Add() and remove() are still blocking, causing potential delays
Non-Blocking Synchronization:
- Need some way to remove the locks completely:
- Treat a node's NEXT and MARKED fields as a single atomic unit
- After presenting their lock-free list implementation, the authors point out that there are costs associated with the strong progress guarantee:
- Atomic modification of ref AND mark has a performance cost
- add() and remove() may possible contend as they each traverse the list and attempt to cleanup removed nodes.
Properties affecting the choice of progress strength include:
- Potential for arbitrary delays
- Relative frequency of add() and remove() calls
- Overhead of implementing an atomic op on ref AND mark
Chapter 10 - Concurrent Queues and the ABA Problem
Pools:
- What: Similar to Set (like those used in chap 9) except:
- Doesn't necessarily proved contains()
- There may be more than one of the same item
- Commonly used as producer-consumer buffers
- Bounded vs. Unbounded:
- A pool can be bounded (limited) to a certain capacity
- Bounded pools may be easier to write
- Bounded pools lend themselves to loose prod/consumer sync
- Unbounded pools may be more useful when prods can out-pace consumers by any margin
- Pool Methods:
- Total: if calls don't wait for a certain condition to be true
- Ex: Consumer's get() returns immediately (exception, error, success, etc) regardless of presence of item to consume
- Interface useful when prod or consumer threads have something better to do than wait for a meth call to take effect.
- Partial: if calls may wait for conditions to hold
- Ex: Consumer's get() blocks until there is something to consume
- Interface useful when prod or consumer threads have nothing better to do than wait for the pool to become non-empty/non-full
- Synchronous: if calls wait for another method to overlap their call interval
- Ex: In a synchronous pool, a producer's put() is blocked until that item is removed by another method call
- Synchronous pools are often used for communication in languages in which threads 'rendezvous' (eg. CSP, Ada)
- Pool Fairness:
- Varies: Could be FIFO, LIFO, or weaker fairness property
Queues:
- Author's here consider a pool that provides FIFO fairness:
- Queue<T>: an ordered sequence of items
- Enq(x) implements put() -> puts element x at the tail of queue
- Deq() implements get() -> gets element at the head of queue
A Bounded Partial Queue:
- Author's describe their implementation w/ a linked list
- Fields:
- HEAD -> refers to first
- TAIL -> refers to last
- size -> AtomicInteger that tracks # of empty slots
- Locks (ensure at most 1 consumer/producer):
- enqLock /w condition notFullCondition
- deqLock w/ condition notEmptyCondition
- Drawbacks:
- All calls involve getAnd{Inc,Dec}rement() on the size field
- Because these are expensive, they may cause sequential bottleneck
- Improvement to address potential size field bottleneck:
- Separate size into a deqSize and a enqSize, which can be checked separately against capacity
An Unbounded Total Queue (author's construction):
- No need to count the number of items (unbounded)
- Enq(x) always enqueues x
- Deq() throws an EmptyException if nothing to dequeue
- Enq(), deq() both lock, but no conditions required
- Otherwise same as bounded partial (though much simpler)
An Unbounded Lock-Free Queue (author's construction):
- Prevents starvation by having fast threads help slower threads
- The lazy lock-free enq(x):
- 1) compareAndSet() changes the next field of the node reffed by TAIL to refer to x
- ## Appends new node
- 2) compareAndSet() advances TAIL to refer to x
- ## Update queue's TAIL to newest
- NB: As these two steps NOT atomic, half-done enq() calls could be encountered
- NB: Since both deq() and enq() both attempt to perform step 2, a deq() call encountering a half-done enq() call will in fact 'help' finish the enq() call by updating TAIL
Memory Reclamation and the ABA Problem:
- The queues discussed so far in the chapter have relied on Java doing garbage collection. What if there is no automatic garbage collection?
- Each thread could maintain a free list of unused queue entries
- If a thread's list is empty it inits a new node
- Else it take from the end of the free list
- On node removal, the node is linked back to free list
- ** Works well when # enqs/deqs are roughly equal
- Given thread-local free lists, the lock-free queue doesn't work out of the box! For example:
- Thread 1 calls deq(), sees sentinel == A, and NEXT == B
- Thread 1 prepares to update HEAD w/ compareAndSet(A, B)
- Thread 2 calls deq(), dequeues A and B, placing them in free list
- A is recycled s.t. eventually sentinel == A
- Thread 2 wakes up, and calls compareAndSet(A, B)
- Thread 2 succeeds, but it has infact set HEAD to a recycled node
- ** the ABA problem
- ABA Problem:
- The final state of seq A->A->A, and A->B->A is identical, but the logical effect of operations may not be.
- Classic Fix:
- Stamp each reference
- Java's AtomicStampedReference<T> encapsulates a ref to <T> and an integer stamp which can be updated together atomically
- Per compareAndSet() attempt in the above classes, the stamp can be incremented to provide knowledge of state history
Chapter 11 - Concurrent Stacks and Elimination
"Surprisingly, perhaps, stacks are not inherently sequential."
An Unbounded Lock-Free Stack:
- Author's implement it over a linked list w/ TOP set to first node
- A push() call creates a new node, then calls tryPush() to try to swing TOP pointer to new node
- If tryPush() succeeds, push() returns, else tryPush() repeats attempt after backoff
- A pop() uses same paradigm as push()
Elimination:
- The aforementioned Lock-Free Stack scales poorly:
- There is contention on TOP (though not too bad with backoff)
- There is a sequential bottleneck, calls must proceed one by one
- Observation: a push() followed by a pop() 'eliminate' each other
- The state before and after is identical
- Java provides EliminationBackoffStack<T>. Briefly:
- Each thread selects a random array location
- If thread A's pop() and B's push() arrive at the same location at (about) the same time, they exchange values, and the Stack isn't touched
The Elimination Backoff Stack:
- Author's recount the following story:
- Two people from different parties are trying to pursuade the other to vote for their candidate on election day. They both fail to move the other, and so one suggests that since their votes will just cancel out anyway, why don't they both just not vote. They agree and part. A friend of the person who suggested not voting remarks on what a 'sporting' offer they had made, to which the suggester replies "not really, 3rd one I've made today!"
- In the elimination array, how to ensure a thread only makes a 'sporting' offer to at most one other thread?
- "Coordination objects called exchangers, objects that allow exactly two threads (and no more) to rendezvous and exchange values."
- A Lock-Free Exchanger:
- Thread A calls the exchanger's exchange() method w/ arg a
- Thread B calls the same exchanger's exchange() method w/ arg b
- A will be returned b, and B will be returned a
- First thread spins waiting for second, or exits after duration
- Exchanger states: EMPTY, WAITING, BUSY
- Exchanger States (when a thread arrives):
- EMPTY:
- The thread places its item in the slot, sets state = WAITING
- If the compareAndSet() on state fails, another thread must have succeeded, so try again.
- If it succeeded, it spins, waiting for another thread.
- WAITING:
- There is already a thread and an item waiting
- The thread takes the item, and tries to replace it with its own using compareAndSet() on state
- If it succeeded replacing the item, it sets state to BUSY
- If it fails it keeps retrying
- BUSY:
- Two threads are engaged in an exchange
- Thread retries
- Linearization Point of exchange():
- When the second thread to arrive changes state: WAITING -> BUSY
- The Elimination Array:
- Implemented as an array of Exchanger objects of max size c
- Threads attempting exchange pick a slot at random, and call that exchanger's exchange() method
- Aforementioned tryPush(), tryPop() now, instead of just backing off on failure, try to use the elimination array to exchange values
- Naively, retry try{Push,Pop}() on failure to exchange
Chapter 16 - Futures, Scheduling, and Work Distribution
Introduction:
- Matrix Multiplication:
- If A[i,j] is the value at (i,j) in matrix A then,
- the product C of two n x n matrices A and B is:
- C[i,j] = SUM[from k=0 -> n-1](A[k,i] * B[j,k])
- Naive multi-threading of matrix multiplication:
- Spawn n x n threads to compute each C[i,j]
- Very inefficient, as thread CRUD tasks are expensive
- Less Naive:
- Create a pool of long-lived threads
- Each thread in pool waits until it is assigned a task
- Task: short-lived unit of computation
- A thread assigned a task completes the task, and returns to pool
- Advantage: eliminate some CRUD task expense
- Advantage: abstraction allows app programmer to not have to worry about platform-specific details like max number of threads.
- Advantage: abstraction allows the same program to run well on single, few or multiprocessor platforms
- In Java thread pool == executor service:
- Provides for task submission
- Provides ability to wait for tasks to complete
- Can cancel uncompleted tasks
- Task w/o result represented as 'Runnable' object w/ run() meth
- Task w/ return of type T: 'Callable<T>' w/ call() meth to get T
- Callable<T> submission results in 'Future<T>' promise object
- Submission of a Runnable task also returns future w/ null return:
- Future<?>
- NB. Futures don't guarantee that any computation actually happen in parallel, they are instead advisory, telling the executor service that it may execute them in parallel.
- Matrix math decomposition:
- Addition:
- 4 additions can be done in parallel + constant split time
- Multiplication:
- 1) 8 products can be done in parallel
- 2) 4 adds can be done in parallel, after 8 products
- Some constant split time
Analyzing Parallelism:
- Assumption: all individual computation steps take equal time
- Tp = Minimum time (in steps) to run program on p dedicated procs
- The program's latency, the time it takes to run start -> finish
- Idealistic
- Lower bound on parallelism extractable via multithreading
- T1 = # steps needed to execute the prog on one proc
- The programs work
- Tp >= T1/P
- T* = (Actually Tinfinity) # steps to execute on unlimited procs
- The program's critical path length
- Tp >= T*
- Speedup of p procs: T1/Tp
- Computation has 'linear speedup' if T1/Tp = O(P)
- Parallelism: The maximum possible speedup T1/T*
- Average amount of work available at each step on critical path
- Provides good estimate for # of procs to use
- Analyzing Matrix Addition:
- Working on n x n matrices
- Requires 4 half-sized additions + constant time to split
- Work A1(n) is therefor given by the recurrence:
- A1(n) = 4A1(n/2) + 0(1)
- = 0(n^2)
- Critical Path Length:
- A*(n) = A*(n/2) + 0(1) |substitute 2^k for n
- A*(n) = A*(2^k-1) + 0(1)
- A*(2^k) = A*(2^k-2) + 2*0(1)
- A*(2^k) = A*(1) + k*0(1) |take biggest term
- A*(2^k) = 0(k) |solve for n
- A*(n) = 0(log2 n)
- Parallelism:
- A1/A* = (n^2) / (log2 n)
- Analyzing Matrix Multiplication:
- Working on n x n matrices
- Requires 8 half-sized multiplications + 4 additions
- Work M1(n) is therefor given by the recurrence:
- M1(n) = 8M1(n/2) + 4A1(n) |substitute 2^k for n
- M1(2^k) = 8M1(2^k-1) + 4A1(2^k)
- M1(2^k) = 64M1(2^k-2) + 16A1(2^k-1)
- M1(2^k) = 8^kM1(1) + 4^kA1(1) |take biggest term
- M1(2^k) = 0(8^k) |solve for n
- M1(n) = 0(2^3k) = 0(n^3)
- Critical Path Length:
- M*(n) = M*(n/2) + A*(n) |substitute 2^k for n
- M*(2^k) = M*(2^k-1) + A*(2^k)
- M*(2^k) = M*(2^k-2) + A*(2^k-1)
- M*(2^k) = kM*(1) + kA*(1) |take biggest term
- M*(2^k) = 0(k) |solve for n
- M*(n) = 0(log2 n)
- Parallelism:
- M1/M* = (n^3) / (log2 n)
Work Distribution:
"We now understand that the key to achieving a good speedup is to keep user-level threads supplied with tasks, so that the resulting schedule is as greedy as possible."
Work Distribution Algorithm:
- Assigns tasks to idle threads as efficiently as possible
- Work Dealing:
- Overloaded task try to offload some tasks to other, less loaded tasks
- Caveat: if most threads are overloaded, they waste much trying to offload
- Work Stealing:
- Threads that run out of work try to 'steal' work from others
- Advantage: if all threads are busy, they waste no cycles trying to offload
- Work Stealing Specifics:
- Each thread has 'double-ended queue' with its waiting tasks
- On task creation, the new task is pushed onto the bottom of queue
- When a thread needs a task to complete, it pops bottom of queue
- If no tasks in its own queue, it becomes a 'thief':
- Randomly chooses victim thread, and pops the top of victim's queue to steal a task for itself
- Author's present a linearizable implementation of a double-ended queue (DEQueue): an array of DEQueues, one per thread
- Authors note that to stop endlessly trying to steal non-existent work, a termination barrier could be used
- Yielding and Multiprogramming:
- Three-level computational model:
- Short-lived 'tasks' executed by system-level 'threads', which are scheduled by the OS on a fixed # of 'procs'.
- Multiprogrammed Environment: #threads > #procs
- Implies that not all threads can run at once
- Implies that any thread can be preempted at any time
- To guarantee progress, threads that have work shouldn't be delayed by threads trying to thieve
- To prevent this, before trying to thieve, a thread calls yield
Chapter 17 - Barriers
Why Barriers?
- Sometimes you need to ensure all work in a certain 'phase' of an execution has completed before moving on to the next phase. The authors give the example of preparing frames when coding a computer game. A single thread could prepare the entire frame, or multiple threads could each prepare some portion of the entire frame, parallelizing the work. However, if multiple threads were preparing a single frame, you'd need some way to ensure that before displaying the frame, each thread has finished its work. Enter barriers.
- This type of situation called "soft real-time"
- Real-time in that it's being presented at 'realistic' fps (eg. 35)
- Soft in that a few errors aren't catastrophic
- A thread's 'Notification Time':
- The "interval between when some thread has detected that all threads have reached the barrier, and when that specific thread leaves the barrier."
- Uniform Notification Time important for soft real-time
- Ex. Picture quality is improved if frame sections are all updated (more or less) at the same instant.
Sense-Reversing Barrier:
- Addresses the issue of reusing barriers
- The barrier has a sense, the threads initially have the opposite (true, false)
- When the threads all complete the barrier, the barrier's and all the individual senses are flipped, allowing immediate reuse.
- A simple Sense-Reversing barrier:
- All threads have a local sense, false
- Barrier has count == # threads
- Barrier has sense true
- Threads arriving at barrier decrement its count, and spin on its sense
- Thread arrives causing count == 0:
- Flip barrier's sense
- All threads pass
- ALl threads flip their local sense
- Barrier's count reset to # threads
- Barrier is ready to be reused
- Advantages:
- Threads spin on locally cached copies of the sense fields (in the presence of cache-coherence), which are only flipped when all threads are ready to leave.
- Keeps bus traffic low(er)
- Improves the uniformity of notification times
Combining Tree Barrier:
- Uses the tree combining paradigm to break down a large barrier into a tree of smaller barriers, up which threads climb as they arrive.
- Tree Barrier:
- n: size, total number of threads
- r: radix, each node's # of children
- d: depth
- s.t. n = r^d
- Each node has a count and a sense field
- A thread which completes any of the individual nodes (count == 0, all children have arrived), visits the node's parent before waking any thread.
- The thread which sets the root node's count == 0, causes all threads to be awakened as each threads then moves back down the tree, flipping each node's sense and resetting their count fields.
- Performance Considerations:
- Introduces memory contention by spreading mem. accesses across multiple barriers. This may or may not reduce latency, depending on which is faster: decrementing or visiting all passed barriers.
- The percolation could cause nonuniform notification times, as threads visit an unpredictable series of locations.
- May cause trouble on cacheless NUMA archs.
Static Tree Barrier:
- Simple barrier w/ static layout and low contention
- Operation:
- Threads are assigned to nodes in a tree
- A thread waits until all threads below it in the tree have finished, then informs the parent
- A thread which decides to inform its parent then spins on the global sense
- When root learns that all children have finished, it flips global sense bit.
- Performance:
- On a cache-coherent multiproc, barrier completion takes log(n) steps moving up tree
- Cache-coherence takes care of propagating global sense
- Without cache-coherence, threads must propagate notification down tree as they did in the combining tree barrier.
Chapter 18 - Transactional Memory
Authors on current synchronization primitives: "..the tools are flawed."
What's Wrong with Locking:
- Priority Inversion: low-prio thread is preempted while holding a lock needed by higher-prio thread.
- Convoying: thread holding a lock is descheduled, causing other threads to queue for it. may take some time for queued threads to go
- Deadlock: two threads separately holding a lock the other needs
- Heart of Issue: no one really knows how to organize and maintain large systems that rely on locking.
What's Wrong with compareAndSet():
- compareAndSet() algos often difficult to devise
- Can have hight overhead
- Main Issue: nearly all operate on single words
- Authors imagine a multiCompareAndSet() which takes an array of expected values, and an array of values to set
- Would greatly simplify logic with intertwined compareAndSets()s
- No obvious way to implement on conventional archs
What's Wrong with Compositionality:
- None of the sync mechanisms we've looked at can be easily composed
Transactions and Atomicity:
- Transaction: a sequence of steps taken by a single thread
- Transaction must be serializable:
- They appear to execute sequentially in one-at-a-time order
- Properly implemented, transactions do not deadlock or livelock
- Atomic Blocks:
- Keyword Atomic: delimits a transaction (like synchronized does a critical section)
- Atomic blocks are atomic w.r.t. all other atomic blocks
- Synchronized blocks only atomic w.r.t. other sync blocks that share the same lock
- Atomic blocks may be nested w/o causing deadlock, unlike sync
- On Transaction Implementation:
- Transactions are executed speculatively
- 1) T makes tentative changes to objects
- 2) If T completes 1) w/o sync conflicts, it commits the changes
- 3) Else, T aborts, and no changes are made
- Transactions can be nested
- Could include retry() method, which rolls back tentative changes on enclosing transaction and restarts it
Software Transactional Memory
H 18.1-18.3.7, 18.4
Get on with it...
From Here on I've just read, not taken notes...