Return to the index of lecture notes
January 23, 2001 (Lecture 3)

Reading

Textbook: 11.4 (Multicast)

Review of Lamport Logical Time

Recall our discussion of Lamport time from last class? Here are some key things to remember:

The example below shows several messages transmitted among the hosts of a distributed system. It illustrates the systems' understaning of local time and the timestamps placed onto each message.

Causality and Causality Violations

Let's take another look at the diagram above. Notice that the first message sent, from P0 to P1, is the last message received. Notice also that a message is sent from P0 to P2 and another from P2 to P3.

The important thing about this situation is that the message from P2 arrives at P1, before the earlier message from P0. This timing problem might prove to be critical.

Consider the following situation that might occur if an object migrated from P0 to P1:

P0 gives Obj to P1 and tells P2. In response P2 sends a request to use that object to P1. Unfortunately, P1 has not yet received the message -- perhaps there was an error and the message needed to be resent or, perhaps, the communication channel is just slower. But, independent of the cause, "Bang!" P2's request to use the object fails.

This example illustrates a causality violation. A causality violation occurs when a message ordering problem results in one host taking an action based on information that another host has not yet received. In this case P2 is trying to invoke a method on P1, because P2 thinks that P1 has Obj.

In designing systems, we assume that any action a host takes may be affected by any message it has previously received. As a result, we would consider the situation above to be a potential causality violation, even if the message from P2 to P1 turned out to be completely independent of the messages that it received. Colloquially, we don't distinguish between potential causality violations and causality violations that have real consequences. Instead we call them both causality violations -- even if the messages turn out to be independent.

The bottom line is that a causality violation occurs if the send of a message knows something that the recipient of that message should know (has been sent), but does not know (has not received), by the time that the message is received.


Student Question: Couldn't we just make it the responsibility of the invoking process to check with the target process first, before trying to invoke the method on the remote object?

Answer: I'm glad you asked. This is an outstanding opportunity to mention the time-honored distrbuted systems principle, "It is easier to move a problem in a distributed system than it is to actually fixed it."

This principle is practiced particularly frequently by contracts who use it to chase bugs around the system, billing multiple times, instead of fixing them right the first time. It does wonders for the economy -- it keeps capital fluid and in motion.

It can also be usefully employed to shift the blame among contractors. In loosely specified systems, it is a good technique to "pin" your problems onto another donkey.

Given your question, the real question becomes, "How can P2 (or underlying mechanisms) be designed to recover if a remote method invocation fails?" The answer to this question may well prove to be more complex than preventing causality violations -- "The devil is in the details."

This is also a good opportunity to mention another famous principle in the design of distributed systems, the ostrich principle. The ostrich is famous for burying its head in the sand. This technique is also frequently used in distributed systems. Sometimes particularly unlikely or obscure problems are allowed to remain in a design. This is particularly true if the cost of the resulting failure is low enough. The bottom line is that implementing a perfect distributed system may involve additional overhead that, in the aggregate, will result in a greater loss in productivity than the rare occurance of an unlikely error state.


Very shortly, we'll talk about designing a communication mechanism that avoid causality violations. But for the moment, lasts ask ourselves, "How can we detect (after the fact) that a causality violation has occurred?"

Lamport time is not sufficient to do this, because it track the total number of events in the system. This isn't helpful -- instead, we need a way of determining if messages were sent and received in the same order. In other words if we receive M2 before M1, but M1 was sent before M2, a (potential) causality violation has occured. The same is true if one or both of the messages arrived indirectly via other hosts. This is one of the areas where vector time becomes particularly useful.

Review of Vector Timestamps

Remember vector time from last class? By comparing vector timestamps, we can detect (after the fact) causality violations. We'll learn how to do this shortly. But, before we do, let's review what we know about vector time:

Let's take another look at the example above -- but this time, let's label it using vector time:

Comparing Vector Timestamps

"Vector timestamps are equal if, and only if, all corresponding elements are the same."

VT1=VT2 iff VT1[i] = VT2[i], for every i = 1, ..., N.

"Vector timestamp VT1 is less than or equal to vector timestamp VT2, if and only if, no element of VT1 is greater than the corresponding element in VT2. In other words, vector timestamp VT1 is not greater than vector timestamp VT2, if and only if, no element of VT1 is greater than the corresponding element in VT2."

VT1<=VT2 iff VT1[i] <= VT2[i], for every i = 1, ..., N.

"Vector timestamp VT1 is strictly less than vector timestamp VT2, if and only if, vector timestamp VT1 <= VT2 (see above), and VT1 is not equal to VT2 (see above)."

VT1<VT2 iff VT1<=VT2 and VT1!=VT2

"VT1 and VT2 represent concurrent events, if and only if, VT1 is neither greater than, less than, nor equal to VT2.

VT1 and VT2 are concurrent, iff, VT1!<VT2 and VT1!>VT2 and VT1!=VT2

Detecting Causality Violations Using Vector Timestamps

We can detect a causality violation using vector timestamps by comparing the timestamp of a newly received message to the local time. If the message's timestamp is less than the local time vector, a (potential) causality violation has occurred.

Why? For the local time to have advanced such that it is ahead of the timestamp of the newly received message, a prior message must have advanced the local time. The sender of that prior message must have gotten the newly arrived message before it sent its prior message to us. Thus a (potential) causality violation occured.

Admittedly, this doesn't fix the problem -- but at least we have a way of detecting and logging the problem. This will make it much easier to isolate and debug or system -- or at least to take mitigating action to ensure that the output from the system is correct.

Now, let's consider the this familiar example again:

M1's timestamp is (1,0,0). The local time on P2 is (2,0,2). (1,0,0) is less than (2,0,2). This indicates that a causality violation has occured -- someone who had already seen M1 sent P2 a message, before P2 received M1.

If the timestamps are concurrent, this does not represent a problem -- the messages are unrelated.

Matrix Logical Clocks

Before we leave time to discuss communication, let me mention one more detail. There is actually another type of logical clock that is one step o more encompassing than a vector logical clock -- the matrix logical clock. Much like a vector clock maintains the simple logical time for each host, a matrix clock maintains a vector of the vector clocks for each host.

Every time a message is exchanged, the sending host tells us not only what it knows about the global state of time, but what other hosts have told it that they know about the global state of time -- relaible gossip.

This is useful in applications such as checkpointing and recovery, and garbage collection. In these cases, having a lower bound on what another host knows can prove useful by enabling the disposal of unusable objects. In the case of garbage collection -- objects that are no other object can reference. In the case of recovery -- logs and/or checkpoints that are no longer needed.

We'll discuss matrix time in more detail when we discuss checkpointing and recovery -- it is much easier to understand with a clear application.

Basic Communication Services

There are three basic modes of communication used in distributed systems:

What do we call a many-to-one communication? A denial of service attack. This is also the domain of a networking course. (Read with heavy sarcasm and a big smile.)

Ordering Guarantees

When we are mutlicasting from one host to many hosts, the message may arrive at each of the hosts at a different time. As a result, if a single host dispatches several multicast messages, they may get "crossed in the mail". The situation can become further tangled if several hosts are multicasting. Do you remember our discussion of causality -- as we used to say in CEDA debate, "cross-apply it here".

Depending on the nature of the interaction of the hosts of the distributed system, we may or may not be concerned with the ordering of the messages. For example, if we know that every message will be completely independent of every other message, a simple reliable multicast will do -- we don't need to do anything special to ensure that the messages arrive in any particular order.

But what if our system isn't quite so relaxed. It may be the case that each host expects its own messages to be received in the order in which they were sent, but that it doesn't matter how they are interleaved with messages from other hosts. This is known as FIFO ordering.

A stricter ordering requirment is to ensure that all causally related messages, independent of the host, are received in the order in which they were sent. Earlier we spoke about detecting causality violations. Now we are discussing the prevention of these violations by enqueing messages and delivering them to the application in the proper order.

The strictest ordering requirement is total ordering. Total ordering requires that the messages be delivered in the same order as if they would be if the communication was instantaneous. In other words, the messages should be received in the same order they would be if messages were received at exactly the same time that they were sent. A reliable, total ordering multicast is known as an atomic muticast. By assuming that the unicast is reliable, we will be constructing an atomic multicast.

FIFO Multicast Protocol

We can ensure FIFO ordering in our mutlicast protocol by using a per source sequence number. Each host maintains a counter and of messages sent and sends this count, a sequence number, with each multicast message.

Each potential receiver maintains a queue for each potential sender (or at least the ability to create such a queue). Each potential receiver also maintains the "expected sequence number" associated with each possible sender. Since the host should receive all mutlticasts, this number should be incremented by exactly one with each multicast message from a particular host.

When a multicast message is received, the sequence number is compared to the expected sequence number. If the sequence number is as expected, the message is passed up to the application and the "expected sequence number" associated with the sender on the receiver is incremented.

If the sequence number of the message is lower than the expected sequence number, the message is thrown away -- it is a duplicate of a message that has already been received.

If the sequence number of the message is higher than the expected sequence number, the message is queued -- it is not yet passed up to the application. The reason is that one or more earlier messages from the same sender have yet to arrive.

Once the expected message has arrived, the queue is check. This queue is probably maintained as a priority queue sorted by sequence number. Messages are dequeued and passed up to the application until the queue is empty, or the next message in the queue is not the "expected message".

Below is an example of this protocol at work:

Casual Ordering Multicast Protocol

We ensure that messages are delivered without causality violations as we did before -- by buffering messages that arrive too early. We determine if a message has arrived too early using a vector timestamp similiar to the one we used to detect causality violations.

The key observation is that with a multicast protocol, all hosts within the group should (eventually) see the same messages. As a consequence, each host should see the same number of messages from each other host.

So, our vector contains one entry for each host. This entry counts the total number of messages received from the corresponding host. The entry for a host that corresponds to itself is used to count the messages it has sent.

Each host sends a copy of its vector with each message and compare the sender's vector with its own on receive:

Below is an example of the causal ordered mutlicast protocol:

Total Order Multicast

Total ordering requires that all messages are seen by all hosts in the same order. This could be easily achieved if we had a global clock or counter that could place serial numbers on messages. Then multicasts would just be accepted in order of serial number, and buffering could be used to handle missing messages. Some sytems emulate this approach using a central sequence number server.

For now, we'll consider a distributed approach that can function in light of differing local times (serial numbers), called the two-phase multicast. In this approach, local times are used. The local time is incremented any time an operation is performed. Any time a system discovers that another system has a greater time, it resets its own time to the greater time.

Here's how it works: