Websocket transport reliability (Socket.io data loss during reconnection)

As already written in another answer, I also believe you should look at the realtime as a bonus : the system should be able to work also with no realtime.

I’m developing an enterprise chat for a large company (ios, android, web frontend and .net core + postGres backend) and after having developed a way for the websocket to re-establish connection (through a socket uuid) and get undelivered messages (stored in a queue) I understood there was a better solution: resync via rest API.

Basically I ended up by using websocket just for realtime, with an integer tag on each realtime message (user online, typers, chat message and so on) for monitoring lost messages.

When the client gets an id which is not monolithic (+1) then it understands it is out of sync so it drops all the socket messages and asks a resync of all its observers through REST api.

This way we can handle many variations in the state of the application during the offline period without having to parse tons of websocket messages in a row on reconnection and we are sure to be synced (because the last sync date is set just by the REST api, not from the socket).

The only tricky part is monitoring for realtime messages from the moment you call REST api to the moment the server replies because what is read from the db takes time to get back to the client and in the meanwhile variations could happen so they need to be cached and took into account.

We are going into production in a couple of months, I hope to get back sleeping by then :)


It is seem that you already have user account system. You know which account is online/offline, you you can handle connect/disconnect event:

So the solution is, add online/offline and offline messages on database for each user:

chatApp.onLogin(function (user) {
   user.readOfflineMessage(function (msgs) {
       user.sendOfflineMessage(msgs, function (err) {
           if (!err) user.clearOfflineMessage();
       });
   })
});

chatApp.onMessage(function (fromUser, toUser, msg) {
   if (user.isOnline()) {
      toUser.sendMessage(msg, function (err) {
          // alert CAN NOT SEND, RETRY?
      });
   } else {
      toUser.addToOfflineQueue(msg);
   }
})

Others have hinted at this in other answers and comments, but the root problem is that Socket.IO is just a delivery mechanism, and you cannot depend on it alone for reliable delivery. The only person who knows for sure that a message has been successfully delivered to the client is the client itself. For this kind of system, I would recommend making the following assertions:

  1. Messages aren't sent directly to clients; instead, they get sent to the server and stored in some kind of data store.
  2. Clients are responsible for asking "what did I miss" when they reconnect, and will query the stored messages in the data store to update their state.
  3. If a message is sent to the server while the recipient client is connected, that message will be sent in real time to the client.

Of course, depending on your application's needs, you can tune pieces of this--for example, you can use, say, a Redis list or sorted set for the messages, and clear them out if you know for a fact a client is up to date.


Here are a couple of examples:

Happy path:

  • U1 and U2 are both connected to the system.
  • U2 sends a message to the server that U1 should receive.
  • The server stores the message in some kind of persistent store, marking it for U1 with some kind of timestamp or sequential ID.
  • The server sends the message to U1 via Socket.IO.
  • U1's client confirms (perhaps via a Socket.IO callback) that it received the message.
  • The server deletes the persisted message from the data store.

Offline path:

  • U1 looses internet connectivity.
  • U2 sends a message to the server that U1 should receive.
  • The server stores the message in some kind of persistent store, marking it for U1 with some kind of timestamp or sequential ID.
  • The server sends the message to U1 via Socket.IO.
  • U1's client does not confirm receipt, because they are offline.
  • Perhaps U2 sends U1 a few more messages; they all get stored in the data store in the same fashion.
  • When U1 reconnects, it asks the server "The last message I saw was X / I have state X, what did I miss."
  • The server sends U1 all the messages it missed from the data store based on U1's request
  • U1's client confirms receipt and the server removes those messages from the data store.

If you absolutely want guaranteed delivery, then it's important to design your system in such a way that being connected doesn't actually matter, and that realtime delivery is simply a bonus; this almost always involves a data store of some kind. As user568109 mentioned in a comment, there are messaging systems that abstract away the storage and delivery of said messages, and it may be worth looking into such a prebuilt solution. (You will likely still have to write the Socket.IO integration yourself.)

If you're not interested in storing the messages in the database, you may be able to get away with storing them in a local array; the server tries to send U1 the message, and stores it in a list of "pending messages" until U1's client confirms that it received it. If the client is offline, then when it comes back it can tell the server "Hey I was disconnected, please send me anything I missed" and the server can iterate through those messages.

Luckily, Socket.IO provides a mechanism that allows a client to "respond" to a message that looks like native JS callbacks. Here is some pseudocode:

// server
pendingMessagesForSocket = [];

function sendMessage(message) {
  pendingMessagesForSocket.push(message);
  socket.emit('message', message, function() {
    pendingMessagesForSocket.remove(message);
  }
};

socket.on('reconnection', function(lastKnownMessage) {
  // you may want to make sure you resend them in order, or one at a time, etc.
  for (message in pendingMessagesForSocket since lastKnownMessage) {
    socket.emit('message', message, function() {
      pendingMessagesForSocket.remove(message);
    }
  }
});

// client
socket.on('connection', function() {
  if (previouslyConnected) {
    socket.emit('reconnection', lastKnownMessage);
  } else {
    // first connection; any further connections means we disconnected
    previouslyConnected = true;
  }
});

socket.on('message', function(data, callback) {
  // Do something with `data`
  lastKnownMessage = data;
  callback(); // confirm we received the message
});

This is quite similar to the last suggestion, simply without a persistent data store.


You may also be interested in the concept of event sourcing.


Michelle's answer is pretty much on point, but there are a few other important things to consider. The main question to ask yourself is: "Is there a difference between a user and a socket in my app?" Another way to ask that is "Can each logged in user have more than 1 socket connection at one time?"

In the web world it is probably always a possibility that a single user has multiple socket connections, unless you have specifically put something in place that prevents this. The simplest example of this is if a user has two tabs of the same page open. In these cases you don't care about sending a message/event to the human user just once... you need to send it to each socket instance for that user so that each tab can run it's callbacks to update the ui state. Maybe this isn't a concern for certain applications, but my gut says it would be for most. If this is a concern for you, read on....

To solve this (assuming you are using a database as your persistent storage) you would need 3 tables.

  1. users - which is a 1 to 1 with real people
  2. clients - which represents a "tab" that could have a single connection to a socket server. (any 'user' may have multiple)
  3. messages - a message that needs sent to a client (not a message that needs sent to a user or to a socket)

The users table is optional if your app doesn't require it, but the OP said they have one.

The other thing that needs properly defined is "what is a socket connection?", "When is a socket connection created?", "when is a socket connection reused?". Michelle's psudocode makes it seem like a socket connection can be reused. With Socket.IO, they CANNOT be reused. I've seen be the source of a lot of confusion. There are real life scenarios where Michelle's example does make sense. But I have to imagine those scenarios are rare. What really happens is when a socket connection is lost, that connection, ID, etc will never be reused. So any messages marked for that socket specifically will never be delivered to anyone because when the client who had originally connected, reconnects, they get a completely brand new connection and new ID. This means it's up to you to do something to track clients (rather than sockets or users) across multiple socket connections.

So for a web based example here would be the set of steps I'd recommend:

  • When a user loads a client (typically a single webpage) that has the potential for creating a socket connection, add a row to the clients database which is linked to their user ID.
  • When the user actually does connect to the socket server, pass the client ID to the server with the connection request.
  • The server should validate the user is allowed to connect and the client row in the clients table is available for connection and allow/deny accordingly.
  • Update the client row with the socket ID generated by Socket.IO.
  • Send any items in the messages table connected to the client ID. There wouldn't be any on initial connection, but if this was from the client trying to reconnect, there may be some.
  • Any time a message needs to be sent to that socket, add a row in the messages table which is linked to the client ID you generated (not the socket ID).
  • Attempt to emit the message and listen for the client with the acknowledgement.
  • When you get the acknowledgement, delete that item from the messages table.
  • You may wish to create some logic on the client side that discards duplicate messages sent from the server since this is technically a possibility as some have pointed out.
  • Then when a client disconnects from the socket server (purposefully or via error), DO NOT delete the client row, just clear out the socket ID at most. This is because that same client could try to reconnect.
  • When a client tries to reconnect, send the same client ID it sent with the original connection attempt. The server will view this just like an initial connection.
  • When the client is destroyed (user closes the tab or navigates away), this is when you delete the client row and all messages for this client. This step may be a bit tricky.

Because the last step is tricky (at least it used to be, I haven't done anything like that in a long time), and because there are cases like power loss where the client will disconnect without cleaning up the client row and never tries to reconnect with that same client row - you probably want to have something that runs periodically to cleanup any stale client and message rows. Or, you can just permanently store all clients and messages forever and just mark their state appropriately.

So just to be clear, in cases where one user has two tabs open, you will be adding two identical message to the messages table each marked for a different client because your server needs to know if each client received them, not just each user.