Reading
Textbook: 11.4 (Multicast)
Review of Lamport Logical Time
Recall our discussion of Lamport time from last class? Here are some key things to remember:
- The global logical time is a measure of the total number of events that have occured on the system. It is incremented with each event, such as the sending or receiving of a message.
- Each system maintains its own sense of the global local time, the local time.
- The local time is sent as a timestamp with each message.
- If a message arrives with a higher logical time than the local time, the receiver increases its local time to match the timestamp of the message. This is because the local system's event counter "fell behind" -- the local system didn't see events occuring on or between other systems. (The local time should still be incremented to insure that the message is received after it is sent.
- If two systems never exchange messages (directly or indirectly), events on these systems are said to be concurrent -- their order cannot be determined.
- Since it is possible for (concurrent events) to have the same timestamps, we need to have a way of "breaking ties." This can be done using hostid. We can represent the resulting timestamp as LamportInteger.HostId. These timestamps can then be directly compared. Although this is arbitrary, it does provide a total ordering and is deterministic, whcih may help to avoid deadlock.
The example below shows several messages transmitted among the hosts of a distributed system. It illustrates the systems' understaning of local time and the timestamps placed onto each message.
![]()
Causality and Causality Violations
Let's take another look at the diagram above. Notice that the first message sent, from P0 to P1, is the last message received. Notice also that a message is sent from P0 to P2 and another from P2 to P3.The important thing about this situation is that the message from P2 arrives at P1, before the earlier message from P0. This timing problem might prove to be critical.
Consider the following situation that might occur if an object migrated from P0 to P1:
![]()
P0 gives Obj to P1 and tells P2. In response P2 sends a request to use that object to P1. Unfortunately, P1 has not yet received the message -- perhaps there was an error and the message needed to be resent or, perhaps, the communication channel is just slower. But, independent of the cause, "Bang!" P2's request to use the object fails.
This example illustrates a causality violation. A causality violation occurs when a message ordering problem results in one host taking an action based on information that another host has not yet received. In this case P2 is trying to invoke a method on P1, because P2 thinks that P1 has Obj.
In designing systems, we assume that any action a host takes may be affected by any message it has previously received. As a result, we would consider the situation above to be a potential causality violation, even if the message from P2 to P1 turned out to be completely independent of the messages that it received. Colloquially, we don't distinguish between potential causality violations and causality violations that have real consequences. Instead we call them both causality violations -- even if the messages turn out to be independent.
The bottom line is that a causality violation occurs if the send of a message knows something that the recipient of that message should know (has been sent), but does not know (has not received), by the time that the message is received.
Student Question: Couldn't we just make it the responsibility of the invoking process to check with the target process first, before trying to invoke the method on the remote object?Answer: I'm glad you asked. This is an outstanding opportunity to mention the time-honored distrbuted systems principle, "It is easier to move a problem in a distributed system than it is to actually fixed it."
This principle is practiced particularly frequently by contracts who use it to chase bugs around the system, billing multiple times, instead of fixing them right the first time. It does wonders for the economy -- it keeps capital fluid and in motion.
It can also be usefully employed to shift the blame among contractors. In loosely specified systems, it is a good technique to "pin" your problems onto another donkey.
Given your question, the real question becomes, "How can P2 (or underlying mechanisms) be designed to recover if a remote method invocation fails?" The answer to this question may well prove to be more complex than preventing causality violations -- "The devil is in the details."
This is also a good opportunity to mention another famous principle in the design of distributed systems, the ostrich principle. The ostrich is famous for burying its head in the sand. This technique is also frequently used in distributed systems. Sometimes particularly unlikely or obscure problems are allowed to remain in a design. This is particularly true if the cost of the resulting failure is low enough. The bottom line is that implementing a perfect distributed system may involve additional overhead that, in the aggregate, will result in a greater loss in productivity than the rare occurance of an unlikely error state.
Very shortly, we'll talk about designing a communication mechanism that avoid causality violations. But for the moment, lasts ask ourselves, "How can we detect (after the fact) that a causality violation has occurred?"
Lamport time is not sufficient to do this, because it track the total number of events in the system. This isn't helpful -- instead, we need a way of determining if messages were sent and received in the same order. In other words if we receive M2 before M1, but M1 was sent before M2, a (potential) causality violation has occured. The same is true if one or both of the messages arrived indirectly via other hosts. This is one of the areas where vector time becomes particularly useful.
Review of Vector Timestamps
Remember vector time from last class? By comparing vector timestamps, we can detect (after the fact) causality violations. We'll learn how to do this shortly. But, before we do, let's review what we know about vector time:
- Vector time is represented as a vector, with one entry corresponding to each host.
- Each entry, or component, of the time vector indicates the total number of events on the corresponding system.
- A host increases its component of its time vector each time it sends or receives a message (or another interesting event occurs).
- When sending messages, they are timestamped by the sender, with the sender's vector timestamp.
- Upon the receipt of a message, a host tries to learn of events on other systems by comparing its vector clock with the timestamp in the message. If it discovers some (or several) component(s) of the timestamp are higher than the corresponding component(s) in its local vector, it changes its local vector to include the higher values -- in other words, it learns of the occurance of additional events on other hosts from the sender. (in any case, it increases its own component of the time vector).
Let's take another look at the example above -- but this time, let's label it using vector time:
![]()
Comparing Vector Timestamps
"Vector timestamps are equal if, and only if, all corresponding elements are the same."
VT1=VT2 iff VT1[i] = VT2[i], for every i = 1, ..., N."Vector timestamp VT1 is less than or equal to vector timestamp VT2, if and only if, no element of VT1 is greater than the corresponding element in VT2. In other words, vector timestamp VT1 is not greater than vector timestamp VT2, if and only if, no element of VT1 is greater than the corresponding element in VT2."
VT1<=VT2 iff VT1[i] <= VT2[i], for every i = 1, ..., N."Vector timestamp VT1 is strictly less than vector timestamp VT2, if and only if, vector timestamp VT1 <= VT2 (see above), and VT1 is not equal to VT2 (see above)."
VT1<VT2 iff VT1<=VT2 and VT1!=VT2"VT1 and VT2 represent concurrent events, if and only if, VT1 is neither greater than, less than, nor equal to VT2.
VT1 and VT2 are concurrent, iff, VT1!<VT2 and VT1!>VT2 and VT1!=VT2
Detecting Causality Violations Using Vector Timestamps
We can detect a causality violation using vector timestamps by comparing the timestamp of a newly received message to the local time. If the message's timestamp is less than the local time vector, a (potential) causality violation has occurred.Why? For the local time to have advanced such that it is ahead of the timestamp of the newly received message, a prior message must have advanced the local time. The sender of that prior message must have gotten the newly arrived message before it sent its prior message to us. Thus a (potential) causality violation occured.
Admittedly, this doesn't fix the problem -- but at least we have a way of detecting and logging the problem. This will make it much easier to isolate and debug or system -- or at least to take mitigating action to ensure that the output from the system is correct.
Now, let's consider the this familiar example again:
![]()
M1's timestamp is (1,0,0). The local time on P2 is (2,0,2). (1,0,0) is less than (2,0,2). This indicates that a causality violation has occured -- someone who had already seen M1 sent P2 a message, before P2 received M1.
If the timestamps are concurrent, this does not represent a problem -- the messages are unrelated.
Matrix Logical Clocks
Before we leave time to discuss communication, let me mention one more detail. There is actually another type of logical clock that is one step o more encompassing than a vector logical clock -- the matrix logical clock. Much like a vector clock maintains the simple logical time for each host, a matrix clock maintains a vector of the vector clocks for each host.Every time a message is exchanged, the sending host tells us not only what it knows about the global state of time, but what other hosts have told it that they know about the global state of time -- relaible gossip.
This is useful in applications such as checkpointing and recovery, and garbage collection. In these cases, having a lower bound on what another host knows can prove useful by enabling the disposal of unusable objects. In the case of garbage collection -- objects that are no other object can reference. In the case of recovery -- logs and/or checkpoints that are no longer needed.
We'll discuss matrix time in more detail when we discuss checkpointing and recovery -- it is much easier to understand with a clear application.
Basic Communication Services
There are three basic modes of communication used in distributed systems:
- Unicast -- Unicast messages are sent from exactly one host to exactly one host. Unicasts can be best effort or reliable. Best-effort messages are sent, but much like the federal post office, the system makes no guarantees about if or when they will arrive. For our purposes best effort delivery does guarantee that a message will arrive intact, or not at all, but not damaged. We're not going to talk about how this is accomplished -- we'll leave that for a networking class. For those who are familiar with transport-level issues, you can imagine that unicast messages (15-612 style) are implemented above UDP using ACKs and retransmits.
- Broadcast -- Broadcast messages are sent from exactly one host to all other hosts on the same network. Reliable broadcast protocols are not practical. This is because it is difficult to know which hosts exist, and also because when interacting with an entire network, some hosts will inevitable by down or unreachable. Broadcast protocols are also the domain of a networking course.
- Multicast -- A Multicast message is a form of a one-to-many message. It is much like a broadcast, but it directed to a much smaller collection of hosts. These hosts may be on the same network, or they may be on another network. For our purposes, multicasts will be implemnted above a reliable unicast. Multicast will be the subject of the rest of today's discussion and some of next class's discussion.
What do we call a many-to-one communication? A denial of service attack. This is also the domain of a networking course. (Read with heavy sarcasm and a big smile.)
Ordering Guarantees
When we are mutlicasting from one host to many hosts, the message may arrive at each of the hosts at a different time. As a result, if a single host dispatches several multicast messages, they may get "crossed in the mail". The situation can become further tangled if several hosts are multicasting. Do you remember our discussion of causality -- as we used to say in CEDA debate, "cross-apply it here".
Depending on the nature of the interaction of the hosts of the distributed system, we may or may not be concerned with the ordering of the messages. For example, if we know that every message will be completely independent of every other message, a simple reliable multicast will do -- we don't need to do anything special to ensure that the messages arrive in any particular order.
But what if our system isn't quite so relaxed. It may be the case that each host expects its own messages to be received in the order in which they were sent, but that it doesn't matter how they are interleaved with messages from other hosts. This is known as FIFO ordering.
A stricter ordering requirment is to ensure that all causally related messages, independent of the host, are received in the order in which they were sent. Earlier we spoke about detecting causality violations. Now we are discussing the prevention of these violations by enqueing messages and delivering them to the application in the proper order.
The strictest ordering requirement is total ordering. Total ordering requires that the messages be delivered in the same order as if they would be if the communication was instantaneous. In other words, the messages should be received in the same order they would be if messages were received at exactly the same time that they were sent. A reliable, total ordering multicast is known as an atomic muticast. By assuming that the unicast is reliable, we will be constructing an atomic multicast.
FIFO Multicast Protocol
We can ensure FIFO ordering in our mutlicast protocol by using a per source sequence number. Each host maintains a counter and of messages sent and sends this count, a sequence number, with each multicast message.Each potential receiver maintains a queue for each potential sender (or at least the ability to create such a queue). Each potential receiver also maintains the "expected sequence number" associated with each possible sender. Since the host should receive all mutlticasts, this number should be incremented by exactly one with each multicast message from a particular host.
When a multicast message is received, the sequence number is compared to the expected sequence number. If the sequence number is as expected, the message is passed up to the application and the "expected sequence number" associated with the sender on the receiver is incremented.
If the sequence number of the message is lower than the expected sequence number, the message is thrown away -- it is a duplicate of a message that has already been received.
If the sequence number of the message is higher than the expected sequence number, the message is queued -- it is not yet passed up to the application. The reason is that one or more earlier messages from the same sender have yet to arrive.
Once the expected message has arrived, the queue is check. This queue is probably maintained as a priority queue sorted by sequence number. Messages are dequeued and passed up to the application until the queue is empty, or the next message in the queue is not the "expected message".
Below is an example of this protocol at work:
![]()
Casual Ordering Multicast Protocol
We ensure that messages are delivered without causality violations as we did before -- by buffering messages that arrive too early. We determine if a message has arrived too early using a vector timestamp similiar to the one we used to detect causality violations.The key observation is that with a multicast protocol, all hosts within the group should (eventually) see the same messages. As a consequence, each host should see the same number of messages from each other host.
So, our vector contains one entry for each host. This entry counts the total number of messages received from the corresponding host. The entry for a host that corresponds to itself is used to count the messages it has sent.
Each host sends a copy of its vector with each message and compare the sender's vector with its own on receive:
- If any entry in the sender's vector, that was sent as a "timestamp" with the multicast message is greater than the corresponding entry in the receiver's local copy fo the vector, the receiver buffers the message. This is because the sender has received a message, whcih is potentially causally related to the message it subsequently sent, that the reciever has not yet received. If the incoming message were passed up to the application, a causality violation might result.
- If the sender's entry in the message's timestamp is more than one greater than the sender's entry in the local time vector, the message is also buffered. This ensures that the protocol ensures FIFO ordering.
- If the sender's entry in the messages timestamp is less than the sender's entry in the local timestamp, the message is rejected -- it is a duplicate.
- If none of the above are true, the message is accepted. Accepting a message offers the opportunity to dequeue previously enqueued messages, if they can now be accepted.
Below is an example of the causal ordered mutlicast protocol:
![]()
Total Order Multicast
Total ordering requires that all messages are seen by all hosts in the same order. This could be easily achieved if we had a global clock or counter that could place serial numbers on messages. Then multicasts would just be accepted in order of serial number, and buffering could be used to handle missing messages. Some sytems emulate this approach using a central sequence number server.For now, we'll consider a distributed approach that can function in light of differing local times (serial numbers), called the two-phase multicast. In this approach, local times are used. The local time is incremented any time an operation is performed. Any time a system discovers that another system has a greater time, it resets its own time to the greater time.
Here's how it works:
- The local time is incremented. The message is sent containing the local time.
- The receiver buffers the message. It then sets its local time to the time of the sender, if the sender's time is higher. increments its local time, and sends an ACK that contains the local time.
- The receiver waits until it has received all of the replies. It then determines the highest local time among the ACKS, and resets its clock, if necessary. It increments its clock and sends a message to all of the original recipients containing the "commit time." This is the time at which the recipient considers the message to have been received.
- Applications on the recipient host can see the message only after all messages received between the "acknowledgement" and "committment" of the message have been committed. This ensures that they won't receive earlier committment times.