Hardening node.js for production part 3: zero downtime deployments with nginx

Below I’ll talk about deploying new node.js code for a HTTP server without ever suffering downtime. This is part of our series on hardening node.js for production use in the Silly Face Society – see part 1 and part 2.

Suffering downtime to perform an upgrade always feels a bit backwards. We do it because it is too technically complicated relative to the expense of an outage. In the Silly Face Society, I’m willing to suffer brief outages for PostgreSQL and Redis upgrades but not for code that we control: our day-to-day bug fixes. Besides bug fixes, the frequency of node.js module updates would require an outage multiple times a week just to stay fresh.

To solve this, the basic idea using nginx’s built-in failover mechanism for upstream servers / processes to shift traffic away from processes that are restarting. Each upstream server will allow existing connections to finish before shutting down and restarting with the new code. We perform a zero downtime upgrade of all processes on the machine by by iterating over each process, shutting it down and bringing it back up. For the rest of the tutorial, I’ll assume you have nginx set up to proxy node requests ala part 2 and are using express.js to handle HTTP requests.

All of this sounds much more complicated than it is, so let’s dive into code:

Graceful Shutdowns in Express.js

We will first implement graceful shutdowns of the process. When our process receives a kill signal we want it to refuse new connections and finish existing ones. I’ll assume you have an express.js server set up with something like:

app = express.createServer()

We can modify this slightly to perform graceful shutdowns on SIGTERM. Additionally, we’ll create a timeout that forcefully exits the process if connections are taking an unreasonable amount of time to close:

httpServer = app.listen(31337)
process.on 'SIGTERM', ->
  console.log "Received kill signal (SIGTERM), shutting down gracefully."
  httpServer.close ->
    console.log "Closed out remaining connections."

  setTimeout ->
    console.error "Could not close connections in time, forcefully shutting down"
  , 30*1000

In the above code, we extract the underlying http server object from express.js (the result of the app.listen call). Whenever we receive SIGTERM (the default signal from kill), we attempt a graceful shutdown by calling httpServer.close. This puts the server in a mode that refuses new connections but keeps existing ones open. If there is a connection hog that doesn’t quit within that time period, we perform an immediate exit (setTimeout does this after 30 seconds). Modify this timeout as appropriate. Note: I don’t use web sockets, but they would be considered connection hogs by the above logic. To achieve zero impactful downtime, you would have to close out these connections manually and have some nifty retry logic on the client.

There is one issue with the code: HTTP 1.1 keep-alive connections would also be considered “connection hogs” and continue to accept new requests on the same connection. Since I use keep-alive connections in nginx, this is a big problem. Ideally we would force node.js into a mode that closes all existing idle connections. Unfortunately, I can’t find any way of doing this with existing APIs (see this newsgroup discussion). Fortunately, we can add middleware that automatically sends 502 errors to new HTTP requests on the server. Nginx will handle the rest (see below). Here’s the modification:

app = express.createServer()
gracefullyClosing = false
app.use (req, res, next) ->
  return next() unless gracefullyClosing
  res.setHeader "Connection", "close"
  res.send 502, "Server is in the process of restarting"
httpServer = app.listen(31337)
process.on 'SIGTERM', ->
   gracefullyClosing = true

This should be mostly self-explanatory: we flip a switch that makes every new request stop with a 502 error. We also send a Connection: close header to hint that this socket should be terminated. As usual, this minimal example is available as a gist.

Ignoring Restarting Servers in Nginx

We will assume you have an nginx server with more than one upstream server in a section like:

upstream silly_face_society_upstream {
keepalive 64;

By default, if nginx detects an error (i.e. connection refused) or a timeout on one upstream server, it will fail over to the next upstream server. The full process is explained within the proxy_next_upstream section of the HttpProxy module documentation. The default is essentially the behaviour we want, modulo fail-overs on keep-alive connections. As mentioned above, we throw a 502 to indicate a graceful shutdown in progress. Insert a proxy_next_upstream directive like:

location @nodejs {
proxy_next_upstream error timeout http_502;
proxy_pass http://silly_face_society_upstream;

With the above addition nginx will failover to the next upstream whenever it gets an error, timeout or 502 from the current one.

Performing zero downtime deployments

Believe it or not, everything is in place to do zero downtime deployments. Whenever new code is pushed we have to bounce each process individually. To gracefully start the server with new code:

  1. Issue a SIGTERM signal (kill <pid> will do that)
  2. Wait for termination. As a simplification, wait the kill timeout and a bit of a buffer.
  3. Start the process up again.

That’s it: nginx will handle the hard work of putting traffic on the healthy processes! If you are running in a managed environment, you can even automate the restarts. I’ve put a new version of my child_monitor.coffee script from part 1 on github as a gist to show how you can go about it. The master process listens for a SIGHUP (indicating a code push). Whenever it receives the signal, it kills+restarts each monitored child with a short waiting period in between each kill.

Bingo bango. Ah, I forgot a shameless plug for our upcoming iPhone game: if you like silly faces, visit www.sillyfacesociety.com and get the app! It’s silltastic!