Technology

Graceful shutdown in Phoenix Apps

Learn how to handle graceful shutdowns in Phoenix apps using Bandit and avoid common pitfalls with Cowboy's drainer. Ensure seamless user experiences during deployments.


Bandit Doesn't Need a Drainer!

Phoenix Graceful Shutdown Explained

If you've deployed Phoenix applications with Cowboy, you've probably seen Plug.Cowboy.Drainer in tutorials and production configs. When we switched to Bandit, we wondered: where's the drainer?

Turns out, you don't need one. Here's why.

What Happens on Shutdown

When your deployment platform (Fly.io, Kubernetes, etc.) wants to stop your app, it sends a SIGTERM signal (a `kill -15` signal). The Erlang VM receives this and begins shutting down the supervision tree in reverse order.

By default, each process gets 5 seconds to finish. If you have a request that takes 10 seconds, it gets killed mid-execution. In practice, most processes respond to the shutdown signal almost instantaneously—a GenServer with no cleanup work exits in microseconds.

The problem is HTTP connections. A request handler might be waiting on a database query, an external API call, or file processing. These need time to complete. And the default 5 seconds often isn't enough.

How Cowboy Handles Shutdown

Cowboy is a battle-tested HTTP server, but it wasn't designed with OTP shutdown semantics in mind. When the Erlang VM sends a shutdown signal to a Cowboy listener, it closes immediately, including any active connections.

Plug.Cowboy.Drainer exists to fill this gap. It's a GenServer that:

  1. Subscribes to the Ranch listener (Cowboy's connection acceptor)
  2. On shutdown, tells Ranch to stop accepting connections
  3. Polls active connection count until zero or timeout
# The Cowboy setup
  children = [
    MyAppWeb.Endpoint,
    {Plug.Cowboy.Drainer, refs: [MyAppWeb.Endpoint.HTTP], shutdown: 30_000}
  ]

The drainer must be placed after the endpoint in the supervision tree. Shutdown happens in reverse order, so the drainer terminates first, triggering the drain before the endpoint dies.

It works well, but it's a bolt-on solution and you have to know that it exists in the first place to configure it. 

How Bandit Handles Shutdown

Bandit is built on ThousandIsland, a socket server designed around OTP principles. The difference shows in shutdown behavior.

ThousandIsland's architecture:

ThousandIsland.Server (Supervisor)
  ├── ThousandIsland.ShutdownListener
  ├── ThousandIsland.AcceptorPoolSupervisor
  │   └── ThousandIsland.Acceptor (multiple)
  └── ThousandIsland.ConnectionsSupervisor
      └── Handler processes (your requests)

When shutdown begins:

  1. ShutdownListener closes the listening socket immediately
  2. All Acceptor processes exit (no new connections)
  3. ConnectionsSupervisor waits for handler processes
  4. Handler processes trap exit and continue until complete or timeout

Handler processes are supervised children with configurable shutdown timeouts. Standard OTP. No special drainer needed.

# Bandit config - shutdown_timeout flows to ThousandIsland
  config :my_app, MyAppWeb.Endpoint,
    adapter: Bandit.PhoenixAdapter,
    http: [thousand_island_options: [shutdown_timeout: :timer.seconds(30)]]

Oban's Shutdown Dance

Oban follows a similar pattern. Each queue runs a producer that fetches jobs and a set of worker processes that execute them.

On shutdown:

  1. Producers stop fetching new jobs
  2. Workers continue executing current jobs
  3. After shutdown_grace_period, remaining jobs are abandoned
  4. A [:oban, :queue, :shutdown] telemetry event fires with orphaned job IDs
config :my_app, Oban,
    repo: MyApp.Repo,
    queues: [default: 10],
    shutdown_grace_period: :timer.seconds(30)

Orphaned jobs aren't lost. On next startup, Oban's rescue mechanism picks them up.

What about mix apps?

For a Mix app without a web server, the equivalent is the shutdown option in child specs, which defaults to 5 seconds. 

# In your Supervisor's child spec
children = [
  %{
    id: MyWorker,
    start: {MyWorker, :start_link, []},
    shutdown: :timer.seconds(30)
  }
]

# Or with the shorthand tuple syntax
children = [
  Supervisor.child_spec({MyWorker, []}, shutdown: :timer.seconds(30))
]

The Platform Layer

Your app's shutdown timeout means nothing if the platform kills you first.

Fly.io sends SIGTERM, then waits kill_timeout seconds before SIGKILL:

# fly.toml
kill_signal = 'SIGTERM'
kill_timeout = 35

Kubernetes uses terminationGracePeriodSeconds:

spec:
    terminationGracePeriodSeconds: 35

Set these higher than your app timeout. We use 35 seconds for a 30-second app timeout.

Supervision Tree Ordering

Shutdown happens in reverse child order. Given:

children = [
    MyApp.Repo,           # 1st to start, last to stop
    {Oban, config},       # 2nd to start, 2nd-to-last to stop
    MyAppWeb.Endpoint     # last to start, first to stop
  ]

This ordering is correct:

  1. Endpoint stops accepting HTTP requests
  2. Oban drains background jobs (can still write to DB)
  3. Repo closes database connections

If you put Repo last in the children list, it would close first during shutdown, and Oban jobs would crash.

Complete Setup

# config/config.exs
  config :my_app, MyAppWeb.Endpoint,
    adapter: Bandit.PhoenixAdapter,
    http: [thousand_island_options: [shutdown_timeout: :timer.seconds(30)]]

  config :my_app, Oban,
    repo: MyApp.Repo,
    queues: [default: 10],
    shutdown_grace_period: :timer.seconds(30)
# fly.toml
  kill_signal = 'SIGTERM'
  kill_timeout = 35

Three lines of config. Zero custom code. Deploys your users won't notice.

Similar posts

Get notified on new fleet insights

Stay up to date with the latest parts and fleet resources from the Gearflow Blog.