Category: SillyFaceSociety

Hardening node.js for production: a process supervisor

This post outlines a basic process supervisor for node.js that can be used to manage multiple PIDs underneath the same application. It contains sample code for our child_monitor with a deep health check for socket servers.

We started using node.js in the Silly Face Society because it seemed like the quickest way to get a (true, not web) socket server up for our “party mode”. The party mode is stateful chat+game server where a set of processes check into redis to advertise themselves as available to host games (in another post I’ll expand on this architecture). For development, node.js was a wonderful way to get a prototype out the door but as our launch approaches are getting nervous the robustness of node.js in production. This is especially noticeable when you tread off the beaten path of HTTP.

To begin: keeping our servers up is hard. Whenever a node.js error is unhandled, it unwinds the stack and leaves v8 in an unrecoverable state. The only recovery is to restart the process. There are few modules that help and forever seems to be the most popular. However, using forever won’t allow us to easily manage a set of child processes that fail independently, nor will it give us deep health checks. In the Silly Face Society’s party mode, each process has a set of game rooms and users in the same room must connect to the same process. For multi-core utilization we run separate processes (i.e. distinct PIDs) on the same host that listen on different ports for socket connections. To supervise these processes we have a script called child_monitor that handles logging, health checks and restarts across multiple PIDs.

Before we get to code, a couple of things to note:

  • forever could be used to manage the child_monitor, although I prefer using Ubuntu’s upstart for this purpose.
  • The cluster API isn’t applicable. Our traffic is routed to specific processes based on the game and thus can’t be load balanced.
  • By capturing exits of child processes, you can perform some recovery tasks for unhandled exceptions in a safe v8 process.

Here is child_process.coffee and a detailed walkthrough:

{_} = require 'underscore'
child_process = require 'child_process'

healthCheckInterval = 60 * 1000

delayTimeout = (ms, func) -> setTimeout func, ms #make setTimeout drink coffee
exports.spawnMonitoredChild = (script, port, healthCheck, environmentVariables) ->
  respawn = ->
    child = child_process.spawn process.execPath, [script],
      env: _.extend(environmentVariables, process.env)

    console.log "Started child, port=#{port}, pid=#{child.pid}"
    child.stdout.pipe process.stdout
    child.stderr.pipe process.stderr

    healthCheckTimeout = null

    delayedHealthCheck = ->
      healthCheckTimeout = delayTimeout healthCheckInterval, ->
        start = new Date()
        healthCheck port, (healthy) ->
          if healthy
            console.log "#{port} is healthy - ping time #{new Date() - start}ms"
            delayedHealthCheck()
          else
            console.error "#{port} did not respond in time - killing it"
            child.kill()

    child.on 'exit', (code, signal) ->
      clearTimeout healthCheckTimeout
      console.error "Child exited with code #{code}, signal #{signal}, respawning"
      respawn()

    delayedHealthCheck()
  respawn()

Also available as part of this gist on github.

First, the method signature:

spawnMonitoredChild = (script, port, healthCheck, environmentVariables) ->

spawnMonitoredChild accepts a script path (i.e. the node.js script you want to spawn), the port of the child (for disambiguation in logging statements and for the healthcheck), a deep healthCheck function and a set of environment variables. environmentVariables can be used to provide variables downwards, e.g. the environment (prod/development) or the child’s port.

The implementation of spawnMonitoredChild declares the respawn function, and then calls respawn() to bootstrap the respawn cycle. respawn is where the magic happens – it starts up the subprocess and performs periodic health checks:

    child = child_process.spawn process.execPath, [script],
      env: _.extend(environmentVariables, process.env)

This uses node’s standard child_process module’s spawn to bring up another instance of v8 (process.execPath) pointing at our child script ([script] are the arguments to the node or coffee scripts).

spawn is used instead of fork because spawn allows capture of the child process’ standard error / standard out.

respawn then proceeds to child process management. First, it takes care of its stderr and stdout:

child.stdout.pipe process.stdout
child.stderr.pipe process.stderr

The above code redirects the child stdout/stderr to that of the supervisor. In our actual production deployment, I instead capture the output streams and log to winston using a statement like this:

child.stderr.on 'data', ((data) -> winston.fatal "ERROR from child #{port}: #{data}")

winston is configured to send me an e-mail whenever a fatal occurs. Usually, winston’s e-mail mechanism can’t be run in the child process after an uncaught exception has occurred. Logging underneath the child_monitor supervisor skirts around this issue.

Moving on, we get to delayedHealthCheck. In a loop (i.e., recursively on callback) it calls the provided the provided healthCheck function. If the check ever fails, it kills the child process and bails on further health checks.

Finally, there is the exit handler for the subprocess:

child.on 'exit', (code, signal) -> 

The code here will explain itself: whenever the child process exits, it calls up to our respawn and logs a fatal.

That’s it! However, it wouldn’t be much of an example without sample use. Here is the healthChecker / spawn code for our “party mode” socket server:

healthCheck = (port, cb) ->
  c = net.connect port, 'localhost'
  c.setEncoding "utf8"

  gotAuth = false
  c.on 'data', (data) ->
    d = null
    try
      d = JSON.parse(data)
    catch error
      c.end()
      console.error "Health check failed: bad initial response, #{data}"
      return cb(false)

    if !gotAuth
      if d.cmd == "PLSAUTH"
        gotAuth = true
        c.write JSON.stringify({cmd:"RING"}) + "\r\n"
      else
        c.end()
        console.error "Health check failed: bad initial response, #{data}"
        return cb(false)
    else
      c.end()
      console.info "Health check response", {res: d}
      return cb(true)

  c.on 'error', (e) ->
    console.error "Health check failed: error connecting #{e}"
    cb(false)

  c.setTimeout config.healthCheckTimeout, -> c.destroy()

numWorkers = 2
startPort = 31337
for i in [0..numWorkers-1]
  port = startPort + i
  child_monitor.spawnMonitoredChild './lib/sfs_socket', port, healthCheck,
    SFS_SOCKET_PORT: port
    SFS_SOCKET_HOST: socketHost

The details aren’t too important. The healthCheck just connects to the local server and sends a RING command. The socket server is supposed to send AHOY-HOY as a response. Don’t ask.

Beyond sockets, we also use child_monitor to manage a set of express.js instances that serve our HTTP actions. By listening on different ports we are able to throw nginx in front as load balancer, http cache and static file host. child_monitor ensures that our backend servers are always available for nginx to call down. I’ll try to make a follow up post with the full details.

One final tip: killing the child_process supervisor should also take down its child processes. A Linux-only prctl system call in the child process will handle this for you. Add this to the top of your child script:

FFI = require('node-ffi')
current = new FFI.Library(null, {"prctl": ["int32", ["int32", "uint32"]]})
current.prctl(1,15)

Naturally, I stole the prctl bit from this stackoverflow post.

For the latest version of this code, see the gist on github.

A Saner S3 PUT for Node.js

The state of node.js libraries is hit and miss. I have been using Knox to do my s3 uploads and recently came across this gem of a stack trace:

assert.js:93
throw new assert.AssertionError({

AssertionError: true == false
at IncomingMessage. (http.js:1341:9)
at IncomingMessage.emit (events.js:61:17)
at HTTPParser.onMessageComplete (http.js:133:23)
at Socket.ondata (http.js:1231:22)
at Socket._onReadable (net.js:683:27)
at IOWatcher.onReadable [as callback] (net.js:177:10)

Sure enough, there is an outstanding issue for Knox that calls to PUT actually crash the node process when Amazon returns a non-200 (https://github.com/LearnBoost/knox/issues/41). Digging deeper into the source code I noticed this comment:

/**
* PUT the file at `src` to `filename`, with callback `fn`
* receiving a possible exception, and the response object.
*
* NOTE: this method reads the _entire_ file into memory using
* fs.readFile(), and is not recommended or large files.
* ...

Yarg! A S3 PUT is not a complicated operation. All I want is a solution that

  • Method signature that takes in a file path and throws it into s3 (i.e. no mucking with request objects)
  • Supports timeouts, HTTP continue (i.e. fails fast)
  • Uses callbacks and pass useful error objects (i.e. the text from amazon)
  • Doesn’t read entire files (!) into memory (i.e. uses pipe from node.js)

Here is what I came up with (in CoffeeScript):

fs = require 'fs'
http = require 'http'
https = require 'https'
crypto = require 'crypto'

mime = require 'mime'
xml2js = require 'xml2js'

delayTimeout = (ms, func) -> setTimeout func, ms
class @S3Put
  constructor: (@awsKey, @awsSecret, @bucket, @secure=true, @timeout=60*1000) ->

  put: (filePath, resource, amzHeaders, callback) ->
    mimeType = mime.lookup(filePath)
    fs.stat filePath, (err, stats) =>
      return callback(err) if err?

      contentLength = stats.size
      md5Hash = crypto.createHash 'md5'

      rs = fs.ReadStream(filePath)
      rs.on 'data', (d) -> md5Hash.update(d)
      rs.on 'end',  =>
        md5 = md5Hash.digest('base64')
        date = new Date()
        httpOptions =
          host: "s3.amazonaws.com"
          path: "/#{@bucket}#{resource}"
          headers:
            "Authorization": "AWS #{@awsKey}:#{@sign(resource, md5, mimeType, date, amzHeaders)}"
            "Date": date.toUTCString()
            "Content-Length": contentLength
            "Content-Type": mimeType
            "Content-MD5": md5
            "Expect": "100-continue"
          method: "PUT"

        (httpOptions.headers[k] = v for k,v of amzHeaders)
        timeout = null

        req = (if @secure then https else http).request httpOptions, (res) =>
          if res.statusCode == 200
            clearTimeout(timeout)
            headers = JSON.stringify(res.headers)
            return callback(null, {headers: headers, code: res.statusCode})

          responseBody = ""
          res.setEncoding("utf8")
          res.on "data", (chunk) ->
            responseBody += chunk

          res.on "end", ->
            parser = new xml2js.Parser()
            parser.parseString responseBody, (err, result) ->
              return callback(err) if err?
              return callback(result)

        timeout = delayTimeout @timeout, =>
          req.abort()
          return callback({message: "Timed out after #{@timeout}ms"})

        req.on "continue", ->
          rs2 = fs.ReadStream(filePath)
          rs2.on 'error', callback
          rs2.pipe(req)

  sign: (resource, md5, contentType, date, amzHeaders) ->
    data = ["PUT", md5, contentType, date.toUTCString(), @canonicalHeaders(amzHeaders).join("\n"), "/#{@bucket}#{resource}"].join("\n")
    crypto.createHmac('sha1', @awsSecret).update(data).digest('base64')

  canonicalHeaders: (headers) ->
    ("#{k.toLowerCase()}:#{v}" for k,v of headers).sort()

Use like

S3Put = require('s3put').S3Put
s3Put = new S3Put("awsKey", "awsSecret", "s3Bucket")
s3Put.put "/path/to/file", "key", {"x-amz-acl": "public-read"}, (err, res) ->
   # err will be the error object given from Amazon (converted from xml)
   # res will contain res.headers and res.code
   console.log "Hurrah"

I’ve also put a gist up here: https://gist.github.com/1347203