Hardening node.js for production: a process supervisor

This post outlines a basic process supervisor for node.js that can be used to manage multiple PIDs underneath the same application. It contains sample code for our child_monitor with a deep health check for socket servers.

We started using node.js in the Silly Face Society because it seemed like the quickest way to get a (true, not web) socket server up for our “party mode”. The party mode is stateful chat+game server where a set of processes check into redis to advertise themselves as available to host games (in another post I’ll expand on this architecture). For development, node.js was a wonderful way to get a prototype out the door but as our launch approaches are getting nervous the robustness of node.js in production. This is especially noticeable when you tread off the beaten path of HTTP.

To begin: keeping our servers up is hard. Whenever a node.js error is unhandled, it unwinds the stack and leaves v8 in an unrecoverable state. The only recovery is to restart the process. There are few modules that help and forever seems to be the most popular. However, using forever won’t allow us to easily manage a set of child processes that fail independently, nor will it give us deep health checks. In the Silly Face Society’s party mode, each process has a set of game rooms and users in the same room must connect to the same process. For multi-core utilization we run separate processes (i.e. distinct PIDs) on the same host that listen on different ports for socket connections. To supervise these processes we have a script called child_monitor that handles logging, health checks and restarts across multiple PIDs.

Before we get to code, a couple of things to note:

  • forever could be used to manage the child_monitor, although I prefer using Ubuntu’s upstart for this purpose.
  • The cluster API isn’t applicable. Our traffic is routed to specific processes based on the game and thus can’t be load balanced.
  • By capturing exits of child processes, you can perform some recovery tasks for unhandled exceptions in a safe v8 process.

Here is child_process.coffee and a detailed walkthrough:

{_} = require 'underscore'
child_process = require 'child_process'

healthCheckInterval = 60 * 1000

delayTimeout = (ms, func) -> setTimeout func, ms #make setTimeout drink coffee
exports.spawnMonitoredChild = (script, port, healthCheck, environmentVariables) ->
  respawn = ->
    child = child_process.spawn process.execPath, [script],
      env: _.extend(environmentVariables, process.env)

    console.log "Started child, port=#{port}, pid=#{child.pid}"
    child.stdout.pipe process.stdout
    child.stderr.pipe process.stderr

    healthCheckTimeout = null

    delayedHealthCheck = ->
      healthCheckTimeout = delayTimeout healthCheckInterval, ->
        start = new Date()
        healthCheck port, (healthy) ->
          if healthy
            console.log "#{port} is healthy - ping time #{new Date() - start}ms"
            delayedHealthCheck()
          else
            console.error "#{port} did not respond in time - killing it"
            child.kill()

    child.on 'exit', (code, signal) ->
      clearTimeout healthCheckTimeout
      console.error "Child exited with code #{code}, signal #{signal}, respawning"
      respawn()

    delayedHealthCheck()
  respawn()

Also available as part of this gist on github.

First, the method signature:

spawnMonitoredChild = (script, port, healthCheck, environmentVariables) ->

spawnMonitoredChild accepts a script path (i.e. the node.js script you want to spawn), the port of the child (for disambiguation in logging statements and for the healthcheck), a deep healthCheck function and a set of environment variables. environmentVariables can be used to provide variables downwards, e.g. the environment (prod/development) or the child’s port.

The implementation of spawnMonitoredChild declares the respawn function, and then calls respawn() to bootstrap the respawn cycle. respawn is where the magic happens – it starts up the subprocess and performs periodic health checks:

    child = child_process.spawn process.execPath, [script],
      env: _.extend(environmentVariables, process.env)

This uses node’s standard child_process module’s spawn to bring up another instance of v8 (process.execPath) pointing at our child script ([script] are the arguments to the node or coffee scripts).

spawn is used instead of fork because spawn allows capture of the child process’ standard error / standard out.

respawn then proceeds to child process management. First, it takes care of its stderr and stdout:

child.stdout.pipe process.stdout
child.stderr.pipe process.stderr

The above code redirects the child stdout/stderr to that of the supervisor. In our actual production deployment, I instead capture the output streams and log to winston using a statement like this:

child.stderr.on 'data', ((data) -> winston.fatal "ERROR from child #{port}: #{data}")

winston is configured to send me an e-mail whenever a fatal occurs. Usually, winston’s e-mail mechanism can’t be run in the child process after an uncaught exception has occurred. Logging underneath the child_monitor supervisor skirts around this issue.

Moving on, we get to delayedHealthCheck. In a loop (i.e., recursively on callback) it calls the provided the provided healthCheck function. If the check ever fails, it kills the child process and bails on further health checks.

Finally, there is the exit handler for the subprocess:

child.on 'exit', (code, signal) -> 

The code here will explain itself: whenever the child process exits, it calls up to our respawn and logs a fatal.

That’s it! However, it wouldn’t be much of an example without sample use. Here is the healthChecker / spawn code for our “party mode” socket server:

healthCheck = (port, cb) ->
  c = net.connect port, 'localhost'
  c.setEncoding "utf8"

  gotAuth = false
  c.on 'data', (data) ->
    d = null
    try
      d = JSON.parse(data)
    catch error
      c.end()
      console.error "Health check failed: bad initial response, #{data}"
      return cb(false)

    if !gotAuth
      if d.cmd == "PLSAUTH"
        gotAuth = true
        c.write JSON.stringify({cmd:"RING"}) + "\r\n"
      else
        c.end()
        console.error "Health check failed: bad initial response, #{data}"
        return cb(false)
    else
      c.end()
      console.info "Health check response", {res: d}
      return cb(true)

  c.on 'error', (e) ->
    console.error "Health check failed: error connecting #{e}"
    cb(false)

  c.setTimeout config.healthCheckTimeout, -> c.destroy()

numWorkers = 2
startPort = 31337
for i in [0..numWorkers-1]
  port = startPort + i
  child_monitor.spawnMonitoredChild './lib/sfs_socket', port, healthCheck,
    SFS_SOCKET_PORT: port
    SFS_SOCKET_HOST: socketHost

The details aren’t too important. The healthCheck just connects to the local server and sends a RING command. The socket server is supposed to send AHOY-HOY as a response. Don’t ask.

Beyond sockets, we also use child_monitor to manage a set of express.js instances that serve our HTTP actions. By listening on different ports we are able to throw nginx in front as load balancer, http cache and static file host. child_monitor ensures that our backend servers are always available for nginx to call down. I’ll try to make a follow up post with the full details.

One final tip: killing the child_process supervisor should also take down its child processes. A Linux-only prctl system call in the child process will handle this for you. Add this to the top of your child script:

FFI = require('node-ffi')
current = new FFI.Library(null, {"prctl": ["int32", ["int32", "uint32"]]})
current.prctl(1,15)

Naturally, I stole the prctl bit from this stackoverflow post.

For the latest version of this code, see the gist on github.