*Most of these are works in progress
Distributed Systems Topics_and_Synopses
Transparency:
- Types:
- Access: Hide differences in data reps. and how objects are accessed.
- Location: Hide where an object is located (ie. geographically).
- Relocation: Hide potential relocation of an object while in use.
- Migration: Hide the potential change in an object location (assumed to be inter-use, unlike relocation)
- Replication: Hide that the object may be replicated.
- Concurrency: Hide that the object may be being used by several different users.
- Failure: Hide the failure and subsequent recovery.
- Degree:
- How much transparency do we want? How much transparency can we actually achieve?
- Perfect transparency is intractable.
- From Thilo: "Full transparency is a nice goal, but so difficult to achieve that it often shouldn't even be aimed for."
Policy vs. Mechanisms: Two ways to implement openness
- Policy:
1. What level of consistency do we require for client-cached data?
2. Which operations do we allow downloaded code to perform?
3. which QoS requirements do we adjust in the face of varying bandwidth?
4. What level of secrecy do we require for communication?
- Mechanisms:
1. Allow the dynamic setting of caching polices
2. Support different levels of trust for mobile code
3. Provide adjustable QoS parameters per data stream
4. Offer different encryption aglos
- HTTP 2 (RFC 7540) Example:
- Protocol provides a mechanism to prioritize different streams
- Protocol leaves the policy implementations - actual assignment of these priorities, ie. JS vs HTML - up to the server.
- Separating Policy from Mechanism:
- Stricter separation between policy and mechanism requires complicating configurations and parameters.
- NB: Hardcoding policies often simplifies management & reduces complexity at the price of flexibility. No silver bullet.
Scalability:
- Components & Points:
- Size: number of users and/or processes
- Individual node storage/computational/inter-component bandwidth. Can server y handle x reqs per min? Can it handle those reqs w/ payloads?
- Geographic: Maximum distance between nodes
- Imagine problems moving systems designed for LAN to WAN.
- Latency, data loss, inconsistent naming, increased difficulty in detecting failure.
- Administrative: Number of administrative domains
- Conflicting polices to do with the usage, management and security in different administrative zones.
- Imagine the difficulty in effectively sharing resources between two universities with different naming conventions/rules/legal frameworks.
- Often less tractable than the other forms of scalability. Lengthening one thread likely tugs at 100s.
- Techniques:
- Hide communication latencies: separate handling of I & O, use of async communication.
- Partition work: ie. move form validation from server to client (php -> js).
- Replication & Caching: data available at multiple points to offload/balance (also provides some fault tolerance) work.
- Problem w/ Replication:
- Applying replication is easy, but addressing the resulting inconsistencies is NOT.
- Addressing inconsistencies requires some sort of global synchronization & "precludes large-scale solutions."
- Typical Design Pitfalls - Inappropriate Assumptions:
- Reliability/Security/Homogeneity of the network
- Consistency of network topology
- Infinite bandwidth, very low/consistent latency
- Cost of bandwidth, number of administrative zones (assuming these are low can lead to unpleasant realizations, or costly post-deployment patching).
Architecture:
- Styles (formulated in terms of)
- Replaceable components w/ well-defined interfaces
- Method of connection
- Data exchanged
- How components are jointly configured
- "Connector": mediates communication/exchange between components
- Layered Architecture:
- The middleware can be configured into different levels, which may interact with eachother in different ways depending on the goal
- Simple Client-Server:
- Processes offering services: Servers.
- Processes using services: Clients.
- Request/Reply communication between Clients and Servers.
- Multi-Tiered:
- Single: Mainframe w/ dumb terminal
- Two: Client/Server
- Enumerate all layers, and determine at which layer to delineate Server from Client
- Three: Each layer on seperate machine
- Structured P2P:
- Semantic-free index: each data item is uniquely associated with a key, which is then used as an index.
- P2P system is then responsible for storing (key,val)
- CHORD:
- Nodes logically organized in ring, each with an m-bit identifier.
- Each data item is hashed to an m-bit key.
- Data item with key k is stored at node with smallest identifier id >= k, called k's successor.
- The ring also uses shortcut links.
- Unstructured P2P:
- Each node maintains an ad hoc list of neighbors.
- Overaly looks like random graph, ie. edge <node_x,node_y> only exists w/ some probability.
- Searching:
- Flooding: asking node passes request to all neighbors. If a node as seen it, it ignores it. Else it passes req. to all neighbors. Resulting in recursion.
- Random Walk: asking node passes request to one random neighbor. If the node has can't fulfil, it chooses a random neighbor and passes, and so on.
- NB: Random Walk is more communication efficient, but slower.
Processes:
- Multi-threaded web client:
- Hide network latency (ie Browser): upon resource discovery, dispatch a thread to fetch
- Multi-threaded server:
- Server able to partition processing, db interaction, I/O
- Multithreaded organization overview:
- Multithreading -> Parallelism, blocking system calls
- Single-Thread -> No parallelism, blocking system calls
- Finite-state machine -> Parallelism, nonblocking system calls
Communication:
- Generally for distributed systems, the lowest level interface of interest is the network layer.
- Middleware layer is intended to provide common services/prototcols that can be used by different applications.
- Middleware may implement:
- Communication protocols
- Marshalling of data
- Naming protocols to facilitate sharing of resources
- Security protocols
- Scaling mechanisms, such as replication and caching
- Classically, middleware exists:
- Hardware < O.S. < Middleware < Application
- Types of Communication:
- Transient vs. Persistent:
- Transient: Message discarded if failure (delivery) at next hop or receive.
- Persistent: Message stored on communication server until delivery.
- Async &(vs) Sync
- Synchronization may occur at different stages:
- Request submission
- Request delivery
- After request processing
- Remote Procedure Call (RPC):
- Culminates in inter-host procedure calls being transparent
- Parameter passing must handle:
- Different data representations (ie. endianness)
- Serialization
- Encoding (basic to fancy, ie. bits to JSON)
- Due to design, all data must be passed as parameters
NB: Under these conditions, full transparency is NOT possible
- Multicast RPC: Sends rpcs to a group of servers
- Message-Oriented Communication:
- Transient Messaging: Sockets
- Berkley Sockets: OPS{socket, bind, listen, accept, connect, send, receive, close}.
- ZeroMQ:
- Based on the idea that much socket communication is patterned
- Provides a higher level of expression by pairing sockets:
- A socket for sending messages from P (process)
- A socket for receving messages at Q (process)
- All async
- Three Patterns: {Request-Reply, Publish-Subscribe, Pipeline}
- Results in Message-Oriented Middleware:
- Asynchronous, persistent communication though support of middleware-level queues.
- Queues correspond to buffers at communication servers.
- Ops{put, get, poll, notify} correspond to:
- Defs{append msg to a queue, get msg from a non-empty queue, check for queued msgs and take first, install handler to be called on put at specific queue}
- Multicasting:
- Idea: Organize distrubed nodes into overlay network for use disseminating data.
- Often organized in a tree, providing unique paths
- Sometimes organized as mesh network, requiring some form of routing
- App-level multicast (ALM) in chord:
1. Initiator generates Multicast Identifier (MC) "mid".
2. Lookup successor(mid), ie. the node responsible for mid.
3. Request is routed to succ(mid), which will become root.
4. If P (node) wants to join, it sends a join request to the root.
5. When request arrives at Q:
- If Q has not seen a join before, it becomes forwarder. P becomes child of Q. Join reuqest is forwarded on.
- If Q knows about tree, then P becomes child of Q, and forwarding is stopped.
- Costs:
- Link Stress: How often does an ALM message cross the same physical link?
- Stretch: Ratio in delay between ALM-level path and network-level path.
- Overlay network unlikely to perfectly (or perhaps even remotely) reflect underlying network. Why overlay if it did?
- Flooding:
- Process P sends a message to each of its neighbors, who each send to each of their neighbors, etc.
- Expense is a direct function of #edges
- Epidemic Protocols:
- First assume there are no "write-write conflicts"
- ww conflict: multiple writes of "uncomitted" data, leading to consitency problems
- Anti-Entropy:
- Each replica regularly chooses another reandom replica nd shares diffs -> both are -eq afterward
- PERF:
- Rumor Spreading:
- A replica which has just been updated (ie. contaminated) tells a # of other replicas about its update (contaminating them too).
- PERF: Cannot ensure that all severs are updated, so removal must be dealth with specially, or propogation will "undelete" over time.
- This problem is resolved by inserting a special "death cert." record which prevents the "undeleting."
- Detecting complete deletion:
- Run a global algo to determine deletion status, and then remove all the death certs.
- Assume death certs propogate in finite time, and set a TTL on them.
- Example Epidemic Apps:
- Data dissemination
- Aggregation: let each node maintain a variable (Vi), and the gossip to achieve an average:
- Vi, Vj <= (Vi + Vj) / 2
- In the end each node will hold SUM(Vi)/N, or the network's cumulative average
Naming:
- Name != (necesarilly) Address
- ID:
- 1-to-1 mapping ID<-->Entity
- No Reuse
- Could be some combination of properties, a hash of something unique to it, etc.
- Flat Naming:
- Broadcasting:
- Broadcast an ID, requesting that it return its address, basically ARP
- Requires all entities to be listening, so not scalable past LAN
- Forwarding Pointers, or how to deal with moves:
- Leave behind a pointer to the new location, but long stings of these would lead to terrible latency
- Home-base Approach:
- Entity has a registerd home address, which is aware of the entities forgien address
- Client contacts its home, then uses the returned forgien address from then on
- Mobile IP uses this model
- Issues: Home address must always be reachable, home/forgien locations which are proximate result in poor performance
- Chord is in this category
- Finger tables keep track of the successors of m entities
- Finger-Table(Process P)[i] = succ(P + 2^(i-1))
- If the overlay isn't very similar to underlying network, performance can be effected
- Node assignment can be made aware of underlying topology, but this isn't always possible
- Nodes can be aware of multiple successors based on proximity:
- Ex: Finger-Table(Process P)[i] points to the first node in [P + 2^(i-1), P + 2^i - 1]
- Stuctured Naming (think File System w/ innodes):
- Namespaces - akin to directory w/ nodes
- Can contain attributed nodes
- Requires some starting point, which must be known before hand, ie. by convention
- DNS, for example
- Iterative Lookup: Client keeps asking name servers until success.
- Recursive Lookup: Client asks 1st server, which asks 2nd etc, until success which is returned along same chain.
- In case of a very long client to server1 hop, recursive would be much faster than iterative (imagine sat link from ship).
- Attribute-based Naming:
- Potentially expensive to lookup based on attribute
- LDAP uses this, to lookup <key,val> pairs
Coordination:
- Sometimes we need exact time, not just ordering. How?
- Clock Synchronization:
- Precision: the deviation between two clocks on any two machines.
- Interal Sync. keeps clocks precise
- Accuracy: the deviation of a single clock from a standard
- External Sync. keeps clocks accurate
- Clocks cannot be set back, would break too many functions
- Instead, alter the speed: slow or speed up the clock to adjust to "good" time.
- Logic Clocks: a clock designed to keep track of order
- Happened-before Relationship:
- If A and B are events, and A happened BEFORE B, then A->B
- Transitivity: If A->B and B->C then A->C
- Results in a 'partial ordering of events'
- Addition of Clocks:
- If A and B are events, A->B, and C(x) is the timestamp of x, then we demand that C(A)<C(B)
- C(x) timestamp is per process, no global clock, hence a problem
- These timestamps are generated by a per-process counter, incremented at any/all events
- These timestamps are sent along with each message
- Upon recepit of a message, a process sets its local counter to MAX{C(received_message), C(local)}
- Total-ordered Multicast:
- A process sends a timestamped message to all processes, and queues it locally
- A process receiving a message queues it locally according to its timestamp, and sends an ACK to all other processes
- This is effectively an "are we all on the same page?" primitive
- A process will pass message M to its application if message M is first in its queue AND it has received an ACK (a message with a timestamp > M's) from every other process.
- This requires that communication is reliable, and that queues are FIFOs
- Problem: Lamport's logical clocks do NOT guarantee causality.
- Solution: Vector Clocks:
- Each process maintains a vector
- VectorClock[i] = the logical clock of Process i
- Therefore if a process's VC[j] = k, the process knows that k events have occured at Process j.
- ALGO:
1. Before executing an event, Process i executes VC[i] = VC[i]+1
2. When Process i sends a message to Process j, it sets the message's time stamp to Process i's VC, ie. messages are passed w/ the sender's VC.
3. When Process j receives a message from Process i, for each local VC entry k, local VC[k] = MAX{local_VC[k], received_VC[k]}, executes step 1, and delivers the message.
- We can now say that A may causually depend on B if time_stamp(B) < time_stamp(A) if:
1. for all k, time_stamp(B)[k] <= time_stamp(A)[k].
2. there exists at least a single k s.t. time_stamp(B)[k] < time_stamp(A)[k].
- Causally Ordered Multicast:
- When multicasting, we can simplify this process and maintain causality assuming the only events are multicasts:
- A process only increments its own VC when sending a message
- Upon reciept of a message, a process "adjusts" its own VC, but doesn't then increment (step 1 from above)
- A process P postpones delivery of message m (ei. passing to application) until:
1. received_VC[i] = local_VC[i]+1
2. received_VC[k] <= local_VC[k] for all k excluding i (ie. k != i)
- Mutual Exclusion:
- Problem: Multiple processes want access exclusive access to the same shared resource.
- Basic Solutions:
- Permission: The process needs permission from each other process
- Tokens: A token is passed among all processes. If the process wants to use the resource, it must have the token, else it will pass the token on.
- Centralized, Permission-based:
- Coordinator permission is required to access a resource.
- If no process is using the resource, a process is granted access, else a process waits for the current user to finish.
- Lamport Clocks for ME:
- Total-ordered multicast ensures messages are delivered in the right, same order.
- ME needs to ensure that processes either agree, or are forced to access exclusive resources in a particular order
- Literal applicaiton of Lamport Total Order Multicasting can be applied
- Ricart & Agrawala:
- The exact same as Lamport's Total Ordered Multicast except that no ACKs are sent.
- Instead of ACKs to all, a process will only ACK, or respond if:
1. It has no interest in the resource
2. It is waiting for the resource but has a lower priority (priority determined by time_stamp comparison)
- Token Ring:
- Network (overlay) is organized into a logical ring
- A single token allowing exclusive access to some resource is passed around the ring
- The process with the token will either use the resource and then pass the token, or simple pass the token immediately.
- Decentralized ME:
- Assume each resource is replicated N times, w/ each replica having its own coordinator.
- Access requires a majority vote from > N/2 coordinators.
- Coordinators respond immediately to requests.
- Comparison of above Approaches by "Messages per Entry/Exit" and "Delay before Entry":
- Message per Entry/Exit Delay before Entry
- Centralized 3 2
- Distributed 3*(N-1) 2*(N-1)
- Token Ring 1,2,...,Infinity 0,1,...,N-1
- Decentralized 2*m*r+m for r=1,2,... 2*m*r
- Election Algorithms:
- Dynamic selection of a special process, for instance, a coordinator.
- Basic requirements:
- All processes have a unique identifier
- All processes know about all other processes (but not their status)
- Election: identify a single process using some scheme (ie. highest ID) that is up
Bullying:
- If a process notices that a coordinator is down, it initiates an election by sending an election packet to all processes with higher ID
- If no process responds, initator becomes coordinator
- If processes respond (they must have higher ID), the initator drops out, and the responders continue to hold elections until the higher ID is found
In a Ring:
- Assumes a ring topology where each node is aware of its successor
- Any process noticing a non-functioning coordinator sends <ID,"ELECTION"> to its successor
- Each process appends its own ID to the message and forwards.
- Upon receiving an election message w/ its own ID in it, a process then sends <ID(highest),"COORDINATOR"> around the ring, setting new coordinator.
Replication & Consistency:
- Replication:
- Improve fault tolerance
- Improve performance by distributing load locally, geographically etc.
- However the replicas need to be kept consistent...
- "Conflicting" operations must be done in the same order everywhere:
- Read-Write conflict: a read operation and a write operation act concurrently
- Write-Write conflict: two concurrent writes
- In order to achieve consistency AND avoid costly global scale synchronization, consistency models must be weakened...
- Consistency Models:
- A contract between distributed processes and data-stores.
- Specifies what the results or reads/writes are in the presence of concurrency.
- Continous Consistency:
- "Degree of consistency"
- Replicas may differ in their numerical value
- Replicas may differ in their relative staleness
- Replicas may differ in their number and order of performed update operations.
- A "Conit":
- Unit over which consistency is the be measured.
- Sequential Consistency:
- The result of any execution is the same as if the operations of all processes were executed in some sequential order.
- The operations of each individual process appear in this sequence in the order specified by the program.
- Causal Consistency:
- Writes that are potentially causually related must be seen by all processes in the same order.
- Concurrent writes may be seen in a differnet order by different processes.
- Grouping Operations:
- Applications may want to perform "critical" sections of multiple reads/writes.
- These critical sections can be entered under mutual exclusion by way of a mutex/lock/sync variable.
- Essence: Only the overall effect of a critical section needs to be known.
- Eventual Consistency:
- In systems that can tolerate a large amount of inconsistency, updates can be performed in "best-effor" manner.
- These systems typically consist of a single writer and many readers, such as DNS and www.
- Specific Types:
- Monotonic Read: If a process reads some data's value, any successive read will return the same, or a more recent value.
- Monotonic Write: Any process's write operation is completed before any successive write operation on the same data.
- Read-your-writes: The effect of a write operation by some process will always be seen by a successive read operation by the same process.
- Write follows read: A write operation by a process on data previously written to by that process is guaranteed to use the written value, or a a more recent value.
- Replica Management:
- Replica Placement Schemes:
- Find least average distance to clients, place server.
- Computationally expensive
- Find the largest "autonomous system" and place server near well connected host, ie. near an interconnect.
- Computationally expensive.
- Map latency to geographic distances, and place servers in all high density areas.
- Computationally cheap.
- Replication Types:
- Permanent: always present replicas
- Server-initiated: dynamically replicated per system rules
- Client-initiated: dynamically replicated per client request ie. client hosting its own cache
- Content Distribution:
- Propagate only on update
- Passive Replication -> transfer data
- Active Replication -> propagate the update operation
- Pushing Updates: server initiated regardless of client requests
- Pulling Updates: client initiated, asks for data/update
NB. Available bandwidth & R/W ratio most important factors for determining most efficient method
- Uses leases to combine distribution methods:
- Server promises to push updates for a specific period of time to a client via a grant, or "lease".
- Ex: Objects that change infrequently shouldn't get polled frequently, so provide a long lease
- Consistency Protocols:
- Describe the interactions between replicas
- Primary-based:
- Each data item has a "primary" process that coordinates writes on that item.
- W/O Local Writes: the primary is not moved based on the client's connection, updates performed at primary
- W/ Local Writes: the primary is moved to the node the client connected to, updated there, and then backed up
- Replicated-write:
- Active Replication: each replica has its own process to perform updates, ie. multicast RPC transferring just ops, not data.
- Requires totally-ordered multicasting
- Quorum-based Solution:
- Each operation requires a majority.
- Given N replicas, N_write > N/2 AND N_write + N_read > N
Fault Tolerance:
- Dependability: "A component provides services to clients."
- Availability: readiness for usage (% of time)
- Reliability: continuity of service delivery
- Safety: probability of catastrophe
- Maintainability: ease of repairing a failed system
- Reliability vs Availability:
- Mean time to Failure (MTTF): Average time until a component fails.
- Mean time to repair (MTTR): Average time needed to repair a component.
- Mean time between failures (MTBF): MTTF + MTTR
- Availability of a Component: Average fraction of time C is up and running.
- A = MTTF/(MTTF+FTTR) = MTTF/MTBF
- Definitions:
- Failure: not living up to spec, ie. crashed program
- Error: part of component that can lead to failure, ie. programming bug
- Fault: cause of an error, ie. sloppy programming
- NB: "A faulty programmer may introduces errors (bugs), which may in turn lead to failures in the execution of the program."
- Types of Failures:
- Crash: halts, after running correctly.
- Omission (send/recv): fails to respond to, receive or send messages.
- Timing: response outside allowed time interval
- Response (value/state transition): response, response value is wrong, CFI violation, ie. login doesn't transition to "authed" state correctly.
- Arbitrary: production of arbitrary responses at arbitrary times.
- Synchronicity & Fault Tolerance:
- Async system: no assumption about process exec. speeds, or message delivery times -> cannot reliably detect crash failures.
- Sync system: process execution speeds and message delivery times are bounded -> we can reliably detect crash failures.
- Partially-sync system (typical): an assumption of synchronicity w/o bounded ansyc. periods -> can normally reliably detect crash failures.
- Halting Problem: A component C no longer perceives any activity from C*. Can C determine whether C* has crashed or if there are omission/timing failures to blame?
- Halting Failure Types (Fail-*):
- Stop: crash failures, but reliably detectable
- Noisy: Crash failures, eventually reliably detectable
- Silent: Omission or crash failures: clients cannot tell what went wrong
- Safe: Arbitrary, yet benign
- Arbitrary: Arbitrary w/ malicious failures
- Failure Masking:
- Information Redundancy: ie. add check bits to message
- Time Redundancy: build system s.t. actions that go awry can be repeated
- Physical Redundancy: add hardware or processes to allow one or more components to fail w/o disrupting the system.
- Process Resilience:
- Essence: protect against malfunctioning processes via process replication, organizing them into process groups.
- Flat Group: Processes are connected to all other processes in their group
- Hierarchical Group: Processes are connected in some sort of hierarchy, likely with a coordinator & worker roles
- Failure Masking & Groups:
- k-fault tolerant group: a group's ability to mask any k concurrent member failures
- k is "degree of fault tolerance"
- How big does a k-fault group need to be (given the following two assumptions)?
- Assumption 1: All members are identical
- Assumption 2: All members process commands in the same order
- Halting failures (crash, omission, timing): k+1 because so long as a single member functions correctly, we will get correct result.
- Arbitrary failures: 2k+1 because erroneous results are possible, the group requires establishing a majority to ensure correct result.
- Consensus:
- To ensure Assumption 2 (above), non-faulty group members need to reach consensus on which command to execute next
- Achieving this w/ Flooding:
1. In a "round" r, each group member multicasts its known set of commands to all other members.
2. At the end of r, each member merges all received commands into a new set, commands_merged.
3. The next command to execute is then selected from commands_merged based on some globaly shared, deterministic function.
- Paxos:
-