Bandit Doesn't Need a Drainer!
Phoenix Graceful Shutdown Explained
If you've deployed Phoenix applications with Cowboy, you've probably seen Plug.Cowboy.Drainer in tutorials and production configs. When we switched to Bandit, we wondered: where's the drainer?
Turns out, you don't need one. Here's why.
What Happens on Shutdown
When your deployment platform (Fly.io, Kubernetes, etc.) wants to stop your app, it sends a SIGTERM signal (a `kill -15` signal). The Erlang VM receives this and begins shutting down the supervision tree in reverse order.
By default, each process gets 5 seconds to finish. If you have a request that takes 10 seconds, it gets killed mid-execution. In practice, most processes respond to the shutdown signal almost instantaneously—a GenServer with no cleanup work exits in microseconds.
The problem is HTTP connections. A request handler might be waiting on a database query, an external API call, or file processing. These need time to complete. And the default 5 seconds often isn't enough.
How Cowboy Handles Shutdown
Cowboy is a battle-tested HTTP server, but it wasn't designed with OTP shutdown semantics in mind. When the Erlang VM sends a shutdown signal to a Cowboy listener, it closes immediately, including any active connections.
Plug.Cowboy.Drainer exists to fill this gap. It's a GenServer that:
- Subscribes to the Ranch listener (Cowboy's connection acceptor)
- On shutdown, tells Ranch to stop accepting connections
- Polls active connection count until zero or timeout
# The Cowboy setup
children = [
MyAppWeb.Endpoint,
{Plug.Cowboy.Drainer, refs: [MyAppWeb.Endpoint.HTTP], shutdown: 30_000}
]
The drainer must be placed after the endpoint in the supervision tree. Shutdown happens in reverse order, so the drainer terminates first, triggering the drain before the endpoint dies.
It works well, but it's a bolt-on solution and you have to know that it exists in the first place to configure it.
How Bandit Handles Shutdown
Bandit is built on ThousandIsland, a socket server designed around OTP principles. The difference shows in shutdown behavior.
ThousandIsland's architecture:
ThousandIsland.Server (Supervisor)
├── ThousandIsland.ShutdownListener
├── ThousandIsland.AcceptorPoolSupervisor
│ └── ThousandIsland.Acceptor (multiple)
└── ThousandIsland.ConnectionsSupervisor
└── Handler processes (your requests)
When shutdown begins:
ShutdownListener closes the listening socket immediately
- All
Acceptor processes exit (no new connections)
ConnectionsSupervisor waits for handler processes
- Handler processes trap exit and continue until complete or timeout
Handler processes are supervised children with configurable shutdown timeouts. Standard OTP. No special drainer needed.
# Bandit config - shutdown_timeout flows to ThousandIsland
config :my_app, MyAppWeb.Endpoint,
adapter: Bandit.PhoenixAdapter,
http: [thousand_island_options: [shutdown_timeout: :timer.seconds(30)]]
Oban's Shutdown Dance
Oban follows a similar pattern. Each queue runs a producer that fetches jobs and a set of worker processes that execute them.
On shutdown:
- Producers stop fetching new jobs
- Workers continue executing current jobs
- After
shutdown_grace_period, remaining jobs are abandoned
- A
[:oban, :queue, :shutdown] telemetry event fires with orphaned job IDs
config :my_app, Oban,
repo: MyApp.Repo,
queues: [default: 10],
shutdown_grace_period: :timer.seconds(30)
Orphaned jobs aren't lost. On next startup, Oban's rescue mechanism picks them up.
What about mix apps?
For a Mix app without a web server, the equivalent is the shutdown option in child specs, which defaults to 5 seconds.
# In your Supervisor's child spec
children = [
%{
id: MyWorker,
start: {MyWorker, :start_link, []},
shutdown: :timer.seconds(30)
}
]
# Or with the shorthand tuple syntax
children = [
Supervisor.child_spec({MyWorker, []}, shutdown: :timer.seconds(30))
]
The Platform Layer
Your app's shutdown timeout means nothing if the platform kills you first.
Fly.io sends SIGTERM, then waits kill_timeout seconds before SIGKILL:
# fly.toml
kill_signal = 'SIGTERM'
kill_timeout = 35
Kubernetes uses terminationGracePeriodSeconds:
spec:
terminationGracePeriodSeconds: 35
Set these higher than your app timeout. We use 35 seconds for a 30-second app timeout.
Supervision Tree Ordering
Shutdown happens in reverse child order. Given:
children = [
MyApp.Repo, # 1st to start, last to stop
{Oban, config}, # 2nd to start, 2nd-to-last to stop
MyAppWeb.Endpoint # last to start, first to stop
]
This ordering is correct:
- Endpoint stops accepting HTTP requests
- Oban drains background jobs (can still write to DB)
- Repo closes database connections
If you put Repo last in the children list, it would close first during shutdown, and Oban jobs would crash.
Complete Setup
# config/config.exs
config :my_app, MyAppWeb.Endpoint,
adapter: Bandit.PhoenixAdapter,
http: [thousand_island_options: [shutdown_timeout: :timer.seconds(30)]]
config :my_app, Oban,
repo: MyApp.Repo,
queues: [default: 10],
shutdown_grace_period: :timer.seconds(30)
# fly.toml
kill_signal = 'SIGTERM'
kill_timeout = 35
Three lines of config. Zero custom code. Deploys your users won't notice.