Q. Consider a distributed system with two sites A and B. Consider whether site A can distinguish among the following:
a. B goes down.
b. The link between A and B goes down.
c. B is much overloaded and its response time is 100 times longer than normal.
What inference does your answer have for recovery in distributed systems?
Answer: One technique would be for B to at regular intervals send an I-am-up message to a indicating it is still alive. If A doesn't receive an I-am-up message it can assume either B-or the network link-is down.
Note that an I-am-up message doesn't permit A to distinguish between each type of failure. One technique that permits A better to determine if the network is down is to send an Are-you-up message to B using an exchange route. If it obtains a reply it can determine that indeed the network link is down and that B is up.
If we presume that A knows B is up and is reachable (via the I-am-up mechanism) in addition to that A has some value N that indicates a normal response time A could monitor the response time from B and compare values to N permitting A to determine if B is overloaded or not.
The inferences of both of these techniques are that A could choose another host-say C-in the system if B is either down or unreachable and or overloaded.