OTP & the Actor Model

The shared inheritance of all BEAM languages: processes, message passing, gen_server, supervision trees, and the "let it crash" path to fault tolerance.

Processes & the Actor Model

Why the BEAM's unit of concurrency is the share-nothing process, and how that implements the actor model.

Almost every concurrency story you've heard from other ecosystems is about threads sharing memory and then fighting over it with locks, mutexes, and atomics. The BEAM - the virtual machine that runs Erlang, Elixir, Gleam, LFE, and Luerl - throws that model out. Its unit of concurrency is the process, and the rule is simple: processes share nothing. They do not see each other's variables. The only way one process can affect another is by sending it a message.

This is a direct implementation of the actor model (Carl Hewitt, 1973). An actor is an isolated entity that, in response to a message, can do three things: send messages to other actors, create new actors, and decide how it will handle the next message. That's the whole vocabulary. The BEAM's processes are actors, and message passing is the entire concurrency API.

These are not OS threads

A BEAM process is not an operating-system thread or process. It is a lightweight, VM-managed green process. Spawning one costs a few microseconds and a couple of kilobytes of memory, so a single node routinely runs hundreds of thousands to millions of them. The VM has its own preemptive scheduler (one per CPU core by default) that time-slices processes fairly - a runaway loop in one process cannot starve the others, because the scheduler counts reductions (work units) and switches when a budget is hit.

Each process has:

  • its own heap (garbage-collected independently, so a GC pause in one process never freezes the whole system),
  • a mailbox (an inbox queue of messages waiting to be read),
  • a unique process identifier, the pid.

Spawning a process

In Erlang you spawn a function. The call returns a pid immediately; the new process runs concurrently.

Pid = spawn(fun() ->
    io:format("hello from ~p~n", [self()])
end),
io:format("spawned ~p~n", [Pid]).

Elixir wraps the same VM primitive in spawn/1:

pid = spawn(fn -> IO.puts("hello from #{inspect(self())}") end)
IO.puts("spawned #{inspect(pid)}")

Gleam, being statically typed, exposes processes through the gleam/erlang/process library; you start a process with process.start, and the spawned function's type is checked at compile time:

import gleam/erlang/process
import gleam/io

pub fn main() {
  let _pid = process.start(fn() { io.println("hello from a process") }, True)
  io.println("spawned")
}

LFE (Lisp Flavoured Erlang) is Erlang with s-expressions, so spawn is the same BEAM primitive, just parenthesized:

(let ((pid (spawn (lambda () (lfe_io:format "hello from ~p~n" (list (self)))))))
  (lfe_io:format "spawned ~p~n" (list pid)))

Why isolation is the whole point

Because processes share no memory, two facts fall out for free:

  1. There are no data races. You cannot corrupt another process's state by writing to it, because you have no reference to write to. There is nothing to lock.
  2. Failure is contained. When a process crashes, only its heap dies. The rest of the system keeps running. This is the seed of the "let it crash" philosophy you'll meet in lesson 5.

The trade-off is that all communication is by copying data into messages (the BEAM has no shared mutable objects to pass by reference). For the message sizes typical of well-designed systems this is cheap, and it's what makes the isolation guarantees real rather than aspirational.

Keep one mental model for the rest of this track: a BEAM program is a swarm of tiny, isolated processes that cooperate only by mailing each other.

Message Passing & Mailboxes

How processes talk: send, mailboxes, selective receive, request/reply, and timeouts.

If processes share nothing, then messages are everything. A message is just a value - a number, an atom, a tuple, a list, whatever - copied from the sender's heap into the receiver's mailbox. The mailbox is an ordered queue: messages arrive in the order they were sent (between any given pair of processes), and the receiver pulls them out when it's ready.

Send and receive in Erlang

You send with the ! operator (Pid ! Message) and read with a receive block. receive does selective matching: it scans the mailbox and removes the first message that matches one of its patterns, running the matching clause.

% A tiny echo loop
loop() ->
    receive
        {From, Text} ->
            From ! {self(), "you said: " ++ Text},
            loop();
        stop ->
            io:format("echo stopping~n")
    end.

start() ->
    Echo = spawn(fun loop/0),
    Echo ! {self(), "hi"},
    receive
        {Echo, Reply} -> io:format("~s~n", [Reply])
    end,
    Echo ! stop.

Two things to notice. First, there is no shared reply channel - if you want an answer, you put your own pid (self()) inside the message so the other process knows where to write back. Second, the loop calls itself (loop()) to wait for the next message. This tail-recursive "receive then loop" is the canonical shape of a stateful process; state is carried as arguments to the loop function, never as a mutable variable.

The same idea in Elixir

Elixir spells the operator send/2 and uses receive ... end with pattern-matched clauses:

defmodule Echo do
  def loop do
    receive do
      {from, text} ->
        send(from, {self(), "you said: " <> text})
        loop()
      :stop ->
        IO.puts("echo stopping")
    end
  end
end

echo = spawn(&Echo.loop/0)
send(echo, {self(), "hi"})
receive do
  {^echo, reply} -> IO.puts(reply)
end
send(echo, :stop)

(The ^echo is a pin: it means "match against the existing value of echo," not "bind a new variable.")

Selective receive and message order

receive does not have to take the first message in the queue - it takes the first message that matches a pattern. Unmatched messages stay in the mailbox in order, to be considered by a later receive. This is powerful (you can wait for one specific reply while ignoring noise) but also a footgun: if messages arrive that never match any clause, they pile up forever and the mailbox grows without bound. A long mailbox also makes each selective scan slower. Real systems guard against this - typically by having a catch-all clause or, better, by using the OTP behaviours from the next lesson, which drain the mailbox on every iteration.

Timeouts and asynchrony

Sends are asynchronous and non-blocking: send returns immediately whether or not the receiver is alive or ever reads the message. There is no delivery guarantee and no backpressure built in - those are properties you design on top. To avoid blocking forever while waiting for a reply, receive takes an after clause:

receive
    {ok, Result} -> Result
after 5000 ->
    {error, timeout}
end.

This synchronous request/reply built from asynchronous sends - stamp your pid on the request, receive the matching reply, fall back on a timeout - is so common that OTP's gen_server packages it up for you as call. You'll almost never write raw receive loops in production code, but understanding them is essential, because every higher-level abstraction on the BEAM is built out of exactly this: send, mailbox, selective receive, loop.

gen_server: the Workhorse Behaviour

OTP's generic server: call vs cast, the callbacks, and how state is threaded without mutation.

Writing raw spawn/receive/loop code by hand is error-prone: you have to remember the request/reply protocol, handle unexpected messages, manage state, support clean shutdown, and deal with timeouts - the same plumbing every single time. OTP (the Open Telecom Platform, the standard library that ships with Erlang) solves this with behaviours. A behaviour is a generic, battle-tested process implementation that handles all the plumbing and calls your callbacks for the parts that are actually specific to your problem.

The most important behaviour by far is gen_server - a generic server process. It owns some state, accepts synchronous and asynchronous requests, and is exactly the right tool for "a process that holds state and responds to messages."

The callbacks you implement

A gen_server separates two kinds of request:

  • call - synchronous request/reply. The caller blocks until your handle_call/3 returns a reply (with a default 5-second timeout). Use it when you need an answer or need backpressure.
  • cast - asynchronous, fire-and-forget. handle_cast/2 runs but sends nothing back. Use it when you don't care about a reply.

You also implement init/1 (set up the initial state) and optionally handle_info/2 (for raw messages that aren't calls or casts, like a 'DOWN' from a monitor or a timer tick) and terminate/2 (cleanup).

A counter as a gen_server (Erlang)

-module(counter).
-behaviour(gen_server).

%% public API
-export([start_link/0, bump/0, value/0]).
%% callbacks
-export([init/1, handle_call/3, handle_cast/2]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, 0, []).

bump()  -> gen_server:cast(?MODULE, bump).        %% async
value() -> gen_server:call(?MODULE, value).        %% sync

%% --- callbacks ---
init(Start) ->
    {ok, Start}.                                   %% State = Start

handle_cast(bump, N) ->
    {noreply, N + 1}.                              %% new State = N+1

handle_call(value, _From, N) ->
    {reply, N, N}.                                 %% reply N, keep State

Notice the two layers. The bottom three functions are callbacks the behaviour invokes; they each return a tagged tuple ({ok, State}, {noreply, State}, {reply, Value, State}) that tells gen_server what to do next. The top three functions (start_link, bump, value) are the public API that callers actually use - they hide the message protocol entirely. A caller just writes counter:bump() and never thinks about pids or receive.

The same server in Elixir

Elixir's GenServer is the identical behaviour with friendlier names. start_link, cast, and call map one-to-one:

defmodule Counter do
  use GenServer

  # public API
  def start_link(_), do: GenServer.start_link(__MODULE__, 0, name: __MODULE__)
  def bump,  do: GenServer.cast(__MODULE__, :bump)
  def value, do: GenServer.call(__MODULE__, :value)

  # callbacks
  @impl true
  def init(start), do: {:ok, start}

  @impl true
  def handle_cast(:bump, n), do: {:noreply, n + 1}

  @impl true
  def handle_call(:value, _from, n), do: {:reply, n, n}
end

Where the state lives

There is no mutable variable anywhere in these examples. The server's state is just a value that gen_server threads through your callbacks: it hands you the current state, you return the next state, and the behaviour holds onto it between messages inside its own receive loop. Because that loop processes one message at a time, every gen_server is automatically serialized - your handle_call/handle_cast code never runs concurrently with itself, so you get the safety of a lock without ever writing one. The single mailbox is the synchronization mechanism.

Other BEAM languages reach the same behaviour. LFE calls gen_server directly (it's just Erlang). Gleam offers a typed wrapper in its gleam/otp/actor library, where messages are a custom type and the compiler checks that every message variant is handled - bringing static guarantees to OTP. But the shape is always the same: API functions on top, callbacks underneath, state threaded through, one message at a time.

Linking, Monitoring & Supervision Trees

Links vs monitors, supervisors, restart strategies, and how failure bubbles up the tree.

Isolated processes are safe, but a real system is a web of cooperating processes, and you need to know when one of them dies. The BEAM gives you two primitives for that, and one grand structure built on top of them.

A link is a symmetric, bidirectional connection between two processes. If either one exits abnormally (crashes), the other receives an exit signal - and by default that signal propagates the crash: the linked process dies too. This sounds alarming, but it's exactly what you want for tightly-coupled processes that have no reason to live without each other. spawn_link atomically spawns and links in one step (avoiding a race where the child dies before you can link it). This is why OTP start functions are named start_link - the new process is linked to its parent.

%% if this child crashes, the linked parent gets the exit signal too
Child = spawn_link(fun() -> exit(boom) end).

A process can opt out of dying by setting trap_exit, which converts incoming exit signals into ordinary {'EXIT', Pid, Reason} messages in its mailbox instead of killing it. This is precisely how a supervisor stays alive while its children crash around it.

Monitors: unidirectional, observe-without-dying

Sometimes you want to know a process died without dying yourself, and without affecting it. That's a monitor - unidirectional and non-propagating. You call monitor, and if the target dies you receive a {'DOWN', Ref, process, Pid, Reason} message. Monitors are the right tool when one process merely depends on another (e.g. a client watching a server it called).

Ref = monitor(process, ServerPid),
receive
    {'DOWN', Ref, process, ServerPid, Reason} ->
        io:format("server died: ~p~n", [Reason])
end.

The rule of thumb: link when two processes should share fate; monitor when one wants to observe another's fate.

Supervisors and supervision trees

Now the payoff. A supervisor is an OTP behaviour whose only job is to start child processes, link to them, trap their exits, and restart them according to a policy when they crash. Supervisors don't contain business logic - they contain recovery logic. You arrange your application as a supervision tree: supervisors at the internal nodes, worker processes (your gen_servers) at the leaves. A supervisor can supervise other supervisors, so the tree can be many levels deep.

Each supervisor has a restart strategy:

  • one_for_one - if a child dies, restart only that child. (The default; children are independent.)
  • one_for_all - if any child dies, kill and restart all children. (Use when children depend on each other's shared state.)
  • rest_for_one - restart the crashed child and any children started after it. (Use for an ordered startup dependency chain.)

Each child also has restart intensity limits (e.g. "max 3 restarts in 5 seconds"); if a child crashes faster than that, the supervisor gives up and escalates by crashing itself, handing the problem to its parent supervisor. Failures bubble up the tree until someone can handle them - or the whole application restarts cleanly.

A supervisor in Elixir

defmodule MyApp.Supervisor do
  use Supervisor

  def start_link(_), do: Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)

  @impl true
  def init(:ok) do
    children = [
      Counter,                       # a worker gen_server
      {Cache, name: :main_cache}     # a worker with start args
    ]
    Supervisor.init(children, strategy: :one_for_one)
  end
end

The same structure in Erlang's supervisor behaviour returns a child_spec list and a strategy from init/1:

init([]) ->
    SupFlags = #{strategy => one_for_one, intensity => 3, period => 5},
    Children = [
        #{id => counter, start => {counter, start_link, []}}
    ],
    {ok, {SupFlags, Children}}.

Why this matters

Supervision trees turn fault tolerance from a hopeful afterthought into the shape of the program. You don't sprinkle try/catch everywhere trying to anticipate every failure. Instead you declare, up front, the static tree of who restarts whom and how. A crash becomes a routine, recoverable event rather than a catastrophe - which is exactly the mindset of the final lesson.

"Let It Crash" & Fault Tolerance

The philosophy that ties it together: separate the happy path from recovery and let supervisors heal.

You've now met every piece: isolated processes, messages, gen_server, links, monitors, and supervisors. The philosophy that ties them together has a deliberately provocative name: "let it crash."

Defensive programming, and why it fails

In most languages you're taught to be defensive: anticipate every bad input, wrap risky calls in try/catch, check for nulls, handle the error right where it happens. The problem is that this scatters error-handling logic through your code, doubling its size and complexity, and - worse - you can only handle the failures you thought of. The bug that actually takes you down in production is, almost by definition, the one you didn't anticipate. Defensive code gives you a false sense of completeness.

The BEAM bet: separate the happy path from recovery

"Let it crash" makes a different bet. Write the happy path clearly, assuming inputs are valid and the world is sane. If a process hits a state it doesn't know how to handle - a bad message, a failed assertion, an unexpected nil - let it crash. Don't catch it; don't try to limp along in a corrupted state. The process dies, its (possibly corrupt) state is thrown away, and its supervisor restarts it from a known-good initial state. The error-handling logic lives in one place (the supervision tree) instead of being smeared across every function.

This works only because of properties we've already established:

  • Isolation means a crash destroys exactly one process's state and nothing else - no shared memory to corrupt.
  • Supervisors mean recovery is automatic, declarative, and rate-limited.
  • A clean restart from a fresh init/1 is often the most reliable way to fix an unknown bad state. Most transient bugs (a momentary network blip, a race, a weird input) simply don't recur after a restart. This is sometimes called fast-fail and recover - the system self-heals.

Jim Gray's research on fault tolerance distinguished Bohrbugs (deterministic, reproducible) from Heisenbugs (transient, timing-dependent). Restarting won't fix a Bohrbug - but it reliably clears Heisenbugs, which dominate in long-running concurrent systems. "Let it crash" is engineering for the bugs that actually happen at scale.

Crash on the assertion, not after it

In practice you write code that asserts its expectations via pattern matching and "happy-path" matches, and lets a mismatch crash:

handle_call({withdraw, Amount}, _From, Balance) when Amount =< Balance ->
    {reply, ok, Balance - Amount}.
%% A withdraw larger than the balance matches NO clause ->
%% the gen_server crashes, the supervisor restarts it clean.

Compare the defensive version you'd write elsewhere - nested ifs, an error code, a partly-mutated balance to unwind. The OTP version says: an over-withdrawal is a programming error or bad request, so fail loudly and let the supervisor sort it out.

The same idiom in Elixir leans on pattern matching in function heads and {:ok, _} / {:error, _} tuples - and crashes (often via a !-suffixed function or a failed match) when something is genuinely wrong:

def handle_call({:withdraw, amount}, _from, balance) when amount <= balance do
  {:reply, :ok, balance - amount}
end
# no matching clause for an over-withdrawal -> crash -> supervised restart

What "crash" does NOT mean

"Let it crash" is widely misunderstood, so be precise:

  1. It does not mean ignore errors. Expected outcomes (a user typo, a 404, a validation failure) are not crashes - they're normal results you return as {:error, reason} and handle in the happy path. You crash only on the genuinely unexpected - the states your code has no sane way to continue from.
  2. It does not mean no error handling - it means error handling lives in the supervision tree, designed up front, not in ad-hoc try/catch everywhere.
  3. It does not mean crash the whole node. Isolation + supervision mean a crash is local; the rest of the system never notices.

The result: nine nines

This combination - isolation, message passing, supervision, and let-it-crash - is why the BEAM became famous for reliability. The canonical example is the Ericsson AXD301 telecom switch, written in Erlang, which reportedly achieved 99.9999999% availability ("nine nines") - on the order of a few dozen milliseconds of downtime per year. No single language feature delivers that. It emerges from the whole model you've learned in this track: small isolated actors that talk only by message, organized into supervision trees, where crashing is not a failure of the design but a part of it.