Popular Posts

Wednesday, February 1, 2012

Disconnected Procedure Calls

Disconnected Procedure Calls

A few years ago, when I had to write a remote procedure call (RPC) system to talk between two systems that needed to handle very spotty availability, and high latency for the calls, I was still playing online Go at nights during that period of my life. It is a perfect information game that has similarities to Chess playing. I had started a correspondence game, where each side only makes a move once every few days most of the time, but might wish to have a burst of activity and make dozens of moves in one sitting. I had never actually finished a game because the latency is absurdly high, given that you had to wait for the other side to get around to making a move. It can take a month to get through these games which can have a total of 300 moves or so. A major flaw in this sort of system is that you cannot opportunistically upgrade to playing in real-time for few times where both players happen to be online at the same time. Chess has a lot fewer moves on average, but the problem is quite similar. This problem of making communications robust in the face of constant outages and absurd latencies in a network protocol struck me as being very similar to a correspondence game. It is precisely what happens when everything moves to mobile devices (including some that have no cellular access, like iPad with wifi only) as well.

I also work at an international company where this phenomenon can really kill us for cooperating on projects. Due to time zones and a different weekend schedule, it can be quite normal to come into work to an ambiguous email requesting a code change. Often it is best handled by simply implementing *both* interpretations of how to write it in a pair of branches, and wait for the reaction to a smoke test in the morning. When I get a reaction back on the two ways of doing it, I can commit one of them to the repository immediately. The benefit of this is that I can reorder getting clarification with getting the fix written, and avoid the artificial latency created by waiting for clarification.

A similar scenario would be to get driving directions from one location to another, and pull as much of expected data as possible before we find ourselves half-way through the trip with no connectivity. If we stay on course and were able to retrieve all data, we can make it through the trip with no more connectivity. If we stray off the course, then we will have to sync with the server for more data.

This is all about designing for a minimum number of contacts with the remote system, and designing the procedure call mechanism so that this is the simplest and default case, rather than something that has to be painfully hand-crafted into each application in a specific manner.

Question and Answer Weaving To Speed Up Protocols

If I am to play Chess with a remote opponent, I will often have a high level strategy in mind. It is not strictly necessary for both sides to take turns making moves. This is an analogy to network packets going back and forth between host machines. When the game begins, both sides will have an opening plan in mind. The first player can submit a tree with the first move committed, and a tree that covers all of the expected responses to the opponent's move. When the second player gets this message, he will only see the first move. He will come up with his response, but also send a tree of all expected responses to the first player's responses. They are exchanging partial game trees. When a player gets a game tree from his opponent and makes a move against it, another move is made against him without any more contact to the remote player, and this continues until a place in the tree is reached where the remote player could not foresee the move. So, to both players, it may hard to tell if the opponent is actually currently online or not. They simply play a game of Chess with a remote opponent. While they are thinking, they can game out the whole tree of expected moves if the opponent is taking too long to respond.

At the point where an unforeseen move is made, the game stops progressing, and both sides are still making their game tree responses at their leisure.. This process continues until the game ends.

Remember that we assume that both opponents are almost never online at the same time. The beauty of this is that the game can proceed at a much faster rate than would be allowed if players only progressed at one move at a time. If player one responds at breakfast, and player two responds at lunch, then there would normally never be more than 2 moves in a 24hr period. But if both sides can predict responses at an average of 4 moves out, then the game proceeds at 4x the maximum rate imposed by moving one move at a time.

In a communications protocol, exactly this sort of thing is going on. A packet of data is sent to the other machine and we wait for a response. If we knew all possible responses that we would get back, then we could also send responses to those messages as well. Under this scenario, we assume that the latency between the machines is so high, that sending whole trees of responses (or some function that generates the tree) is trivial in cost.

Now Imagine Two Chess Engines Playing Each Other

If we had two humans cheating with Chess Engines (or simply taking a long time to come up with their game trees), then they could exchange very deep trees with each other and make a lot of progress. Boring sequences (ie: long predictable chains that aren't really avoidable) would be exchanged in large chunks, and the game would only have to wait when something surprising happens. If both sides had the same chess engine, then the trees of responses might be very easy to compress into small messages.

Now Imagine That The Network Is Down 99% Of The Time

The main motivation for this style of communication is to mask network outages or to mask the unavailability of the opponent. Take the example of a cell phone in airplane mode, but with Wifi enabled, as an example. One of the opponents is out on travel in a foreign country where normal connectivity is turned off to avoid high phone bills, but there will be bursts of connectivity (open Wifi) available for short periods of time. Presuming that there is some predictability to the conversation that will be had, boring sequences can be skipped over to speed things along.

If the network is almost always down, then the remote procedure call must be designed to work asynchronously. Furthermore, both sides must write to a local message queue. When the network is down, we can work normally and pretend that we will eventually get a response to everything that we are doing that has a question for the other side. When connectivity is found, the two sides reconcile the stream of messages they have for each other, making progress.

Every Remote Call Must Have A Timeout

When an asynchronous call is made, a timeout is mandatory, even if it is very long. The timeout must be within the time that we are guaranteed to remember that we made the request and are waiting on an answer. When a timeout happens, we will get an exception from our local queue. This exception will cause effects similar to what would happen if the remote end actually got the message and had to report an error.

Servers Reboot Constantly and Take Forever To Respond

If all progress is written into a persistent queue, then we can take the persistent queues to be the major state of the system. As long as all parties sync with each other before passing any timeout thresholds, no errors are observed, and the system makes progress as designed. It should not matter how often the remote computer is turned on or talking to the network. It could be a hand-held device that is turned off when it is not actively being used.

What RPC Needs To Look Like In This World

Synchronous calls that fail because something remote is currently unavailable will just produce lots of "errors" that aren't really errors. You can't design protocols for mobile systems like this without implicitly forcing them all to drain their batteries and run up phone bills polling the network. I am saying that the current design of the web is deeply flawed now that we aren't all sitting at desktops. If you visit a website from a mobile device, and the website can state that it doesn't expect the data to change for another 24hrs, then it is reasonable to expect that for the next 24hrs you can continue to view the data without a network connection. Currently, almost every web site and browser combination in existence doesn't behave like this.

These systems must base their communications on persistent local queues that will drain data to (and reconcile with) the remote systems when opportunity arises. It is similar to a queue based replication scheme. Data already gotten needs to be smartly cached, and exchanges of a conversation need to be smartly predicted.

Communications API

I would expect that an in-browser JavaScript version of this sort of library would be highly useful, especially when designed for mobile apps. Web browsers talking via an async API mediated by queues. (Ajax, except it's not an error if the remote end isn't available. It's only an error if the remote end can't respond in time. And the message queue and implicit app state needs to survive reboots(!!), because it can take days to get a response in some cases.)

//Tell remote that we would like to run its makeMove with given args,
//and invoke makeMoveDone with its response, and consider the game over if we don't get
//a response in two days.
dpc.invoke( opponent, game0, makeMoveSequence("(e1 (c6 (e2)))"), makeMoveDone, TwoDaysFromNow() );
//When makeMoveDone is invoked, we may have made multiple moves of progress,
//all committed by the remote end at his leisure at different times (ie: an hour apart for each
//response.)

makeMoveDone might get an exception argument back, either due to the server being unable to handle it, or simply from the timelimit getting missed. Besides guaranteeing that an exception or an answer is eventually gotten back, the weaving together of sequences of moves to minimize latency is important.

Summary

I know that existing message queue libraries address some of these requirements, but am not quite so sure that anything exists to weave together questions and answers in a general way. Very few internet systems are designed from the ground up to assume that neither user is available for the rare periods in which connectivity is available, while trying to let applications be built as if the user is actually available (but very slow to respond). The benefit of doing it this way is that the periods in which both users are online will suddenly allow rapid progress, yet allow things to proceed smoothly when this is not the case.