Orphaned Gunicorn Processes with Supervisor

Python web applications generally expose a WSGI interface for client communication, which is rather unusual if you are used to more novel approaches like those of Go applications, where APIs are generally served via HTTP. Most applications have an HTTP gateway in front, either for load balancing, high availability or TLS termination, so a Python application, like e.g. Django projects need an HTTP server which interfaces with WSGI.

WSGI servers also deal with worker spawning, because of Python’s Global Interpreter Lock, which would otherwise limit Python applications to be executed on one thread at any given moment.

A popular WSGI server is Gunicorn, which itself is written in Python, so integrating it into an application is just one install command away. Launching a web application just takes a /app/venv/bin/gunicorn myapp.wsgi. By specifying a number of workers using the --workers parameter a higher parallelism is configurable.

Now, in order to ensure that a web application is started on boot and restarted if it fails for some reason, you need an init system. A well-regarded approach is to use supervisord, which makes it easy to write unit files, which define how an application will be run:

[program:myapp]
environment = DJANGO_SETTINGS_MODULE="myapp.settings"
user = myapp
command = /app/venv/bin/gunicorn myapp.wsgi

A handy feature of Gunicorn are graceful restarts: You only need to send a HUP signal to the master process, which will finish handling all current requests while rolling over to the new version of the application. Though, if you change supervisord’s config, you may need to truly restart the units.

This can lead to an unexpected behavior: The Gunicorn master process gets seemingly shut down, but the spawned children processes are still there. When supervisord tries to launch the new master, it gets stuck as the binded port is still occupied. A look on htop reveals the problem:

kernel
`- /sbin/init
   `- /app/venv/bin/gunicorn myapp.wsgi --workers 2
   `- /app/venv/bin/gunicorn myapp.wsgi --workers 2
   `- ...
   `- /usr/local/bin/supervisord

Normally, the Gunicorn application server processes should be children of supervisord. But instead they have been orphaned and are now children of PID 1, the init system. They can’t handle any traffic, because the master is already gone, thus they are effectively deadlocked.

A temporary solution is to kill -9 the orphaned processes, which will allow to respawn the application as soon as the supervisor retries the next time. This will unlock the application for the moment. Still, with the next supervisor restart the problem will appear again.

The real cause of this problem is a specific supervisor behavior, which will send signals only to the controlled process but not cascade it to its children. This can be resolved by setting stopasgroup=true in the unit file. Setting it will prevent orphaning.

Note that restarting/reloading supervisor is not needed in most cases, e.g. when you are rolling out a new application version, it suffices to use supervisorctl signal hup myapp, which will prevent a service disruption.

The recommended supervisor unit configuration for a Python application run with Gunicorn is:

[program:myapp]
environment = DJANGO_SETTINGS_MODULE="myapp.settings"
user = myapp
command = /app/venv/bin/gunicorn myapp.wsgi
redirect_stderr = true
stopasgroup = true