← History

OTP, Supervision, and "Let It Crash"

How the BEAM turns failure into a feature: gen_servers, supervision trees, restart strategies, and the links and monitors that hold it all together.

ErlangElixir

Most runtimes treat a crash as a catastrophe to be prevented at all costs. The BEAM treats it as ordinary control flow. A process that hits an unexpected state simply dies, a supervisor notices, and a fresh process takes its place in a known-good state. That single inversion - let it crash instead of defend at every line - is the philosophical core of OTP (the Open Telecom Platform), the set of libraries and design principles that ship with Erlang/OTP and that every BEAM language inherits.

This article walks through the machinery that makes "let it crash" safe: generic servers, supervision trees, restart strategies, and the links and monitors that wire failure detection into the runtime itself.

Why crashing is a feature

The argument rests on a few BEAM properties that, taken together, make recovery cheaper than prevention.

Joe Armstrong, Erlang's co-creator, framed the goal as building systems that keep running despite errors, not systems that pretend errors never happen. You write the code for the case you understand and let the supervisor handle the rest.

gen_server: the generic stateful process

Almost every long-lived process in an OTP system is a gen_server. Underneath it is just a tail-recursive receive loop holding some state - but hand-rolling that loop means re-implementing message decoding, synchronous calls with timeouts, system messages, code upgrade, and graceful shutdown every time. gen_server factors all of that out into a behaviour: a generic engine plus a handful of callbacks you provide.

The two request shapes are calls (synchronous: the caller blocks for a reply) and casts (asynchronous: fire-and-forget). Here is a small counter in Erlang:

-module(counter).
-behaviour(gen_server).

%% Public API
-export([start_link/0, bump/1, value/0]).
%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, 0, []).

bump(N)  -> gen_server:cast(?MODULE, {bump, N}).   %% async
value()  -> gen_server:call(?MODULE, value).       %% sync

%% --- callbacks run inside the server process ---
init(Count) ->
    {ok, Count}.

handle_cast({bump, N}, Count) ->
    {noreply, Count + N}.

handle_call(value, _From, Count) ->
    {reply, Count, Count}.

The crucial discipline: the public functions (bump/1, value/0) run in the caller's process and just send a message; the callbacks (handle_call, handle_cast) run in the server's process and own the state. State is threaded through return values, never mutated in place.

Elixir wraps the identical behaviour with friendlier syntax. The semantics are exactly the same because it is the same :gen_server underneath:

defmodule Counter do
  use GenServer

  # Public API (runs in the caller)
  def start_link(_), do: GenServer.start_link(__MODULE__, 0, name: __MODULE__)
  def bump(n), do: GenServer.cast(__MODULE__, {:bump, n})
  def value, do: GenServer.call(__MODULE__, :value)

  # Callbacks (run in the server)
  @impl true
  def init(count), do: {:ok, count}

  @impl true
  def handle_cast({:bump, n}, count), do: {:noreply, count + n}

  @impl true
  def handle_call(:value, _from, count), do: {:reply, count, count}
end

LFE implements the very same behaviour in Lisp syntax - gen_server is just an Erlang module, and LFE calls Erlang modules natively:

(defmodule counter
  (behaviour gen_server)
  (export (start_link 0) (bump 1) (value 0))
  (export (init 1) (handle_call 3) (handle_cast 2)))

(defun start_link ()
  (gen_server:start_link #(local counter) 'counter 0 '()))

(defun bump (n)  (gen_server:cast 'counter (tuple 'bump n)))
(defun value ()  (gen_server:call 'counter 'value))

(defun init (count) (tuple 'ok count))

(defun handle_cast
  ((`#(bump ,n) count) (tuple 'noreply (+ count n))))

(defun handle_call
  (('value _from count) (tuple 'reply count count)))

Gleam, being statically typed, cannot use the untyped gen_server contract directly - a handle_call that returns "a reply or a stop tuple" has no single Erlang type. Instead Gleam ships its own type-safe actor in gleam/otp/actor, built on the same OTP primitives but with a typed message enum. The compiler guarantees you only ever send messages the actor understands:

import gleam/erlang/process.{type Subject}
import gleam/otp/actor

pub type Message {
  Bump(by: Int)
  Value(reply_to: Subject(Int))
}

fn handle(state: Int, msg: Message) -> actor.Next(Int, Message) {
  case msg {
    Bump(by) -> actor.continue(state + by)
    Value(reply_to) -> {
      process.send(reply_to, state)
      actor.continue(state)
    }
  }
}

pub fn start() {
  actor.new(0) |> actor.on_message(handle) |> actor.start
}

Note what static typing buys and costs: Value carries its own typed reply channel (Subject(Int)), and an actor can only receive Message values - a whole class of "unexpected message" bugs is gone at compile time. The trade is that you opt into Gleam's actor API rather than the raw gen_server callback shape.

Luerl is the outlier, and honestly so. Luerl runs Lua on the BEAM, but a Lua script is not a BEAM process - it executes synchronously inside whatever Erlang process is driving it, with the entire Lua state threaded through as one immutable Erlang value. Lua has no gen_server, no spawn, no mailbox. Concurrency and supervision live in the host: you put the Luerl interpreter inside an Erlang or Elixir gen_server, and OTP supervises that process. Lua code stays blissfully unaware:

-- Plain Lua, evaluated by a Luerl interpreter embedded in a BEAM host.
-- No processes here; the surrounding Erlang/Elixir gen_server owns
-- concurrency, crashes, and restarts on this script's behalf.
local count = 0
local function bump(n) count = count + n end
bump(5)
return count        -- 5, returned to the host process

Before supervisors, there are links. A link is a bidirectional bond between two processes: if either dies abnormally, an exit signal propagates to the other. By default that signal kills the linked process too, so a cluster of linked processes lives or dies as a unit - which is exactly what you want when a worker is useless without its helper.

%% spawn_link atomically spawns a process AND links to it,
%% with no window where the child could die unlinked.
Pid = spawn_link(fun worker_loop/0),

A process that wants to survive its links and react to their deaths instead of dying sets the trap_exit flag. Exit signals then arrive as ordinary {'EXIT', Pid, Reason} messages in the mailbox rather than killing the process. This is precisely how a supervisor stays alive while its children come and go:

init([]) ->
    process_flag(trap_exit, true),   %% turn exit signals into messages
    {ok, no_state}.

handle_info({'EXIT', From, Reason}, State) ->
    error_logger:info_msg("child ~p died: ~p~n", [From, Reason]),
    {noreply, State}.

One subtlety worth internalising: a normal exit (reason normal) does not kill a linked process - only abnormal exits propagate. And exit(Pid, kill) sends the untrappable kill signal, which terminates the target even if it traps exits. That escape hatch is how supervisors guarantee they can always shut a misbehaving child down.

Monitors: observation without entanglement

Sometimes you want to know when a process dies without your own fate being tied to it. That is a monitor - unidirectional and non-propagating. You set one up, and if the target dies you receive a single 'DOWN' message. No signal kills you; you just get told.

Ref = monitor(process, Pid),
receive
    {'DOWN', Ref, process, Pid, Reason} ->
        io:format("~p went down: ~p~n", [Pid, Reason])
end.

This is exactly what gen_server:call/2 uses under the hood: it monitors the callee for the duration of the request so that if the server crashes mid-call, the caller gets a clean error instead of hanging forever. The distinction is the whole game:

Link Monitor
Direction bidirectional unidirectional
On peer death exit signal (kills, unless trapping) 'DOWN' message only
Set up by link/1, spawn_link monitor/2
Used for lifecycle coupling, supervision "tell me when X dies"

Elixir exposes both through the Process module with the same semantics:

ref = Process.monitor(pid)

receive do
  {:DOWN, ^ref, :process, ^pid, reason} ->
    IO.puts("#{inspect(pid)} went down: #{inspect(reason)}")
end

Supervisors and restart strategies

A supervisor is a process whose only job is to start children, link to them (trapping exits), and restart them according to a declared policy when they die. Supervisors and their children form a tree; the leaves are workers doing real work and the inner nodes are supervisors all the way up to the application root. This supervision tree is the structural backbone of every OTP application.

Children are described by a child specification - what to start, and how to treat it. The two fields that drive recovery are:

The supervisor's restart strategy decides what happens to siblings when one child dies:

In Erlang the supervisor is itself a behaviour whose init/1 returns the strategy and the child specs:

-module(top_sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    SupFlags = #{strategy => one_for_one,
                 intensity => 3,    %% allow 3 restarts...
                 period => 10},     %% ...within 10 seconds
    Children = [
        #{id => counter,
          start => {counter, start_link, []},
          restart => permanent,
          shutdown => 5000,
          type => worker}
    ],
    {ok, {SupFlags, Children}}.

The intensity/period pair is the system's circuit breaker: if children crash more than intensity times within period seconds, the supervisor concludes the problem is not transient, gives up, and crashes itself - propagating the failure up to its supervisor, which may restart a larger subtree. Crashes climb the tree until something can recover; the alternative to an infinite restart loop is escalation, not silence.

Elixir's Supervisor expresses the same thing more declaratively. Modern Elixir favours child_spec/1 on the child module and a list of children:

defmodule TopSup do
  use Supervisor

  def start_link(_), do: Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)

  @impl true
  def init(:ok) do
    children = [
      Counter,                       # uses Counter.child_spec/1
      {Cache, size: 1_000}           # child + start args
    ]

    Supervisor.init(children,
      strategy: :one_for_one,
      max_restarts: 3,
      max_seconds: 10
    )
  end
end

LFE builds the same supervisor as an OTP behaviour, returning the identical strategy-plus-specs structure:

(defmodule top-sup
  (behaviour supervisor)
  (export (start_link 0) (init 1)))

(defun start_link ()
  (supervisor:start_link #(local top-sup) 'top-sup '()))

(defun init (_args)
  (let ((flags (map 'strategy 'one_for_one
                    'intensity 3
                    'period 10))
        (children (list (map 'id 'counter
                             'start #(counter start_link ())
                             'restart 'permanent
                             'shutdown 5000
                             'type 'worker))))
    (tuple 'ok (tuple flags children))))

Gleam supervises with its own typed module - recent versions use gleam/otp/static_supervisor, which mirrors the OTP strategies (OneForOne, OneForAll, RestForOne) while keeping child startup type-checked:

import gleam/otp/static_supervisor as supervisor
import gleam/otp/supervision

pub fn start() {
  supervisor.new(supervisor.OneForOne)
  |> supervisor.add(supervision.worker(start_counter))
  |> supervisor.start
}

And Luerl, once more, sits below this layer: there is no Lua supervisor because there is no Lua process. The robust pattern is one Erlang/Elixir worker per Lua interpreter, supervised normally. If a script misbehaves - runs away, errors, exhausts a step budget - the host worker crashes and the supervisor restarts it with a fresh interpreter state, giving untrusted Lua the same fault tolerance as native BEAM code without Lua needing to know OTP exists.

How it fits together

Stack the pieces and the philosophy becomes concrete:

  1. A worker is a gen_server written only for the cases it understands.
  2. When it meets a case it does not, it crashes - no defensive scaffolding required.
  3. Because it was started with spawn_link under a supervisor, an exit signal reaches that supervisor.
  4. The supervisor, trapping exits, applies its restart strategy and stands up a clean replacement in a known-good state.
  5. If failures come too fast, the supervisor escalates by crashing too, and recovery moves up the tree.

Isolation makes crashing safe; links and monitors make it observable; supervisors make it automatic. The result is the BEAM's signature property - software that heals itself - and the reason "let it crash" is not recklessness but one of the most reliable engineering strategies ever shipped.