Hardening node.js for production part 3: zero downtime deployments with nginx

Below I’ll talk about deploying new node.js code for a HTTP server without ever suffering downtime. This is part of our series on hardening node.js for production use in the Silly Face Society – see part 1 and part 2.

Suffering downtime to perform an upgrade always feels a bit backwards. We do it because it is too technically complicated relative to the expense of an outage. In the Silly Face Society, I’m willing to suffer brief outages for PostgreSQL and Redis upgrades but not for code that we control: our day-to-day bug fixes. Besides bug fixes, the frequency of node.js module updates would require an outage multiple times a week just to stay fresh.

To solve this, the basic idea using nginx’s built-in failover mechanism for upstream servers / processes to shift traffic away from processes that are restarting. Each upstream server will allow existing connections to finish before shutting down and restarting with the new code. We perform a zero downtime upgrade of all processes on the machine by by iterating over each process, shutting it down and bringing it back up. For the rest of the tutorial, I’ll assume you have nginx set up to proxy node requests ala part 2 and are using express.js to handle HTTP requests.

All of this sounds much more complicated than it is, so let’s dive into code:

Graceful Shutdowns in Express.js

We will first implement graceful shutdowns of the process. When our process receives a kill signal we want it to refuse new connections and finish existing ones. I’ll assume you have an express.js server set up with something like:

app = express.createServer()
...
app.listen(31337)

We can modify this slightly to perform graceful shutdowns on SIGTERM. Additionally, we’ll create a timeout that forcefully exits the process if connections are taking an unreasonable amount of time to close:

httpServer = app.listen(31337)
process.on 'SIGTERM', ->
  console.log "Received kill signal (SIGTERM), shutting down gracefully."
  httpServer.close ->
    console.log "Closed out remaining connections."
    process.exit()

  setTimeout ->
    console.error "Could not close connections in time, forcefully shutting down"
    process.exit(1)
  , 30*1000

In the above code, we extract the underlying http server object from express.js (the result of the app.listen call). Whenever we receive SIGTERM (the default signal from kill), we attempt a graceful shutdown by calling httpServer.close. This puts the server in a mode that refuses new connections but keeps existing ones open. If there is a connection hog that doesn’t quit within that time period, we perform an immediate exit (setTimeout does this after 30 seconds). Modify this timeout as appropriate. Note: I don’t use web sockets, but they would be considered connection hogs by the above logic. To achieve zero impactful downtime, you would have to close out these connections manually and have some nifty retry logic on the client.

There is one issue with the code: HTTP 1.1 keep-alive connections would also be considered “connection hogs” and continue to accept new requests on the same connection. Since I use keep-alive connections in nginx, this is a big problem. Ideally we would force node.js into a mode that closes all existing idle connections. Unfortunately, I can’t find any way of doing this with existing APIs (see this newsgroup discussion). Fortunately, we can add middleware that automatically sends 502 errors to new HTTP requests on the server. Nginx will handle the rest (see below). Here’s the modification:

app = express.createServer()
...
gracefullyClosing = false
app.use (req, res, next) ->
  return next() unless gracefullyClosing
  res.setHeader "Connection", "close"
  res.send 502, "Server is in the process of restarting"
...
httpServer = app.listen(31337)
process.on 'SIGTERM', ->
   gracefullyClosing = true
   ...

This should be mostly self-explanatory: we flip a switch that makes every new request stop with a 502 error. We also send a Connection: close header to hint that this socket should be terminated. As usual, this minimal example is available as a gist.

Ignoring Restarting Servers in Nginx

We will assume you have an nginx server with more than one upstream server in a section like:

upstream silly_face_society_upstream {
server 127.0.0.1:61337;
server 127.0.0.1:61338;
keepalive 64;
}

By default, if nginx detects an error (i.e. connection refused) or a timeout on one upstream server, it will fail over to the next upstream server. The full process is explained within the proxy_next_upstream section of the HttpProxy module documentation. The default is essentially the behaviour we want, modulo fail-overs on keep-alive connections. As mentioned above, we throw a 502 to indicate a graceful shutdown in progress. Insert a proxy_next_upstream directive like:

...
location @nodejs {
...
proxy_next_upstream error timeout http_502;
...
proxy_pass http://silly_face_society_upstream;
}
...

With the above addition nginx will failover to the next upstream whenever it gets an error, timeout or 502 from the current one.

Performing zero downtime deployments

Believe it or not, everything is in place to do zero downtime deployments. Whenever new code is pushed we have to bounce each process individually. To gracefully start the server with new code:

  1. Issue a SIGTERM signal (kill <pid> will do that)
  2. Wait for termination. As a simplification, wait the kill timeout and a bit of a buffer.
  3. Start the process up again.

That’s it: nginx will handle the hard work of putting traffic on the healthy processes! If you are running in a managed environment, you can even automate the restarts. I’ve put a new version of my child_monitor.coffee script from part 1 on github as a gist to show how you can go about it. The master process listens for a SIGHUP (indicating a code push). Whenever it receives the signal, it kills+restarts each monitored child with a short waiting period in between each kill.

Bingo bango. Ah, I forgot a shameless plug for our upcoming iPhone game: if you like silly faces, visit www.sillyfacesociety.com and get the app! It’s silltastic!

  • http://twitter.com/dlerner daniel

    Super nice!
    I’ll reuse the next stream part, I thought it was the default behavior for nginx’s upstreams.

    What are you using for monitor/alerts/restart? I’ve installed a combination of upstart & monit.
    Good luck with the launch!

  • Pingback: Hardening node.js for production part 3: zero downtime deployments with nginx | Arg! Team Blog | Modern web development | Scoop.it

  • Pingback: Hardening node.js for production part 3: zero downtime deployments with nginx | Arg! Team Blog | javascript.js | Scoop.it

  • aqarynet

    GREAT ARTICLE,,,,,i really like d it

  • Pingback: Hardening node.js for production part 2: using nginx to avoid node.js load | Arg! Team Blog

  • Pingback: Retrogaming.org » Nginx + Node.JS

  • tony

    why coffeescript…

    • pavelnikolov

      why not?

      • chovysblog

        …oh boy…here we go again….

  • pavelnikolov

    I would use 503 status code instead of 502 during maintenance. This would tell the client (nginx) that the server (node process) is temporary unavailable.

  • Pingback: Optimising NginX, Node.JS and networking for heavy workloads — GoSquared Engineering

  • http://twitter.com/binarykitchen m heuberger

    I doubt your code really works. When a node process is about to exit, the main event loop won’t accept any new events, in your case the setTimeout. See http://nodejs.org/api/process.html#process_event_exit

    • http://twitter.com/cosbynator Thomas Dimson

      While it is possible my code has bugs, I’m not listening to process exit events – I’m listing to SIGTERM.

  • Tom

    Pretty cool. I wish you guys had a webapp so I could try it.

  • Pingback: Optimising NginX, Node.JS and networking for heavy workloads - Lofrank's Blog

  • jpap

    Great trio of posts!

    For each upstream Node “server”, do you use separate folders/filesystems for application code during your “deploy” round-robin restart?

    I can imagine that if the Node processes depend on the local filesystem, they might have issues unless each process has a separate filesystem copy. Even something simple like a require(…) that hasn’t yet been cached by the system (though unlikely).

    How do you manage database schema updates, or are you NoSQL?

    • tdimson

      Hey Jpap – I do not use separate folders for application code. I don’t think the file system will be doing any caching – after I issue the restart command, the processes themselves all tear down and bring themselves. I guess it would give you a way to “abort” the restart if one of the restarts go sour. That could be neat.

      I use a combination of PostgreSQL and redis. For PostgreSQL I use “liquibase” to handle my schema updates. It is a heavyweight java program with XML migrations but it does the job.

      • jpap

        Good to know about liquibase, looks like a nice generic ActiveRecord-like migration tool that is sure to be helpful until the Node-based Rails clones mature.

        On caching, I’m referring to Node’s require() cache [1]. If you’re updating the filesystem underneath one Node instance while it is still running, you might cause your application code to crash if it calls require() on a yet-to-be require()’d code file. If your backend app is completely warm, you might never see it.

        A related problem can occur if your Node app uses local disk. (Sounds like it doesn’t in your case.)

        [1] http://nodejs.org/docs/latest/api/modules.html#modules_caching

        • tdimson

          Aha, gotcha. My processor supervisor does a full kill on child processes so I think the cache would be empty when they restart. Still, good to remember if I ever decide to do some hot swapping.