Hot Code Loading: upgrading a running system

How the BEAM swaps a module's code underneath running processes without dropping a single call - the two-version rule, code:load_file, code_change, and the release machinery telephone switches demanded.

ErlangElixir

Most deployments look the same: stop the old process, start the new one, and apologise for the gap in between. Behind a load balancer you can hide that gap, but the program itself still dies and is reborn - open connections drop, in-memory state evaporates, anything mid-flight is lost.

The BEAM was built by people who could not accept that gap. A telephone exchange that drops calls during an upgrade is a telephone exchange that drops calls, and "we were deploying" is not an answer a phone company gives. So Erlang's runtime can do something genuinely rare: replace the code of a running module while processes are still executing it, with no stop, no restart, and no lost state. This is hot code loading, and it reaches surprisingly deep into how the VM is designed.

This article walks through the mechanism from the bottom up: how the BEAM holds two versions of a module at once, how a process crosses from old code to new, the kernel functions that drive it, the OTP code_change callback that migrates state, the full release-upgrade machinery built on top, and why telecom needed every bit of it.

Why a phone switch can never stop

Start with the requirement, because it explains the whole design. In 1998 Ericsson shipped the AXD301, an ATM switch carrying over a million lines of Erlang (later more than two million). Its claim to fame was a reported availability of nine nines - 99.9999999% uptime, which works out to roughly 32 milliseconds of downtime per year.

You do not reach numbers like that by deploying carefully. You reach them by never taking the system down at all - including for the routine, constant business of fixing bugs and shipping features. A carrier-grade switch runs for years; in that time its software will be patched hundreds of times. If every patch meant a reboot, nine nines would be physically impossible. As Joe Armstrong put it, hot swapping - changing code without stopping the system - is a defining feature of the Erlang runtime, not an add-on. The requirement came first; the BEAM mechanism was reverse-engineered from "the switch must never go down."

That heritage is why hot code loading is built into the virtual machine rather than bolted on by a library. Every language on the BEAM inherits it for free.

The two-version rule

The foundation is a deceptively simple rule. For any given module, the BEAM keeps at most two versions loaded at once: a current version and an old version. Both are valid, and - this is the key part - both can run concurrently.

When you load a module the first time, it becomes current. Load it again and the previously current code is demoted to old; the freshly loaded code becomes the new current. Processes that were already executing inside the old code keep running there, undisturbed. New external calls resolve to the current version. Two generations of the same module coexist in memory, each serving the processes that belong to it.

            load v1                 load v2                 load v3
   (none) ─────────▶  current=v1 ─────────▶  current=v2 ─────────▶  current=v3
                                              old=v1                  old=v2
                                                              (v1 purged: any process
                                                               still in v1 is killed)

That last step is the hard limit. Because there are only two slots, loading a third version forces the issue. The code server purges the oldest code to make room - and any process still lingering in that now-doomed version is terminated. The mantra is worth memorising: you can be one version behind, but not two. Load a module twice in quick succession, before stragglers have migrated out of the original, and you will kill them.

This is not a bug; it is a forcing function. It guarantees the system can never accumulate an unbounded pile of stale code, and it gives upgrade tooling a clear contract: get everyone onto the current version before you load again.

Crossing from old code to new

Here is the subtle part. If a process is sitting in a long-running loop in the old version, what makes it ever start running the new version? It does not happen automatically. A process moves to current code only when it makes a fully qualified function call - a call that names the module explicitly, like Module:Function(...), as opposed to a local call to a bare function(...).

The distinction is precise:

A local call (loop()) stays in the same version of the module the caller is already running. An old-code process that calls itself locally keeps running old code, forever.
A fully qualified / external call (m:loop()) is resolved through the module's export table at call time, and always lands in the current version.

So the classic Erlang idiom for an upgradable server loop is to bounce through a fully qualified call whenever it might need to pick up new code:

-module(m).
-export([loop/0]).

loop() ->
    receive
        code_switch ->
            %% Fully qualified call: jump to the CURRENT version of m.
            %% If new code is loaded, the new loop/0 runs from here on.
            m:loop();
        Other ->
            handle(Other),
            loop()      %% local call: stay in this (possibly old) version
    end.

Send that process a code_switch message after loading new code and it crosses the boundary deliberately, at a point where its state is known-good. This is the raw machinery; in practice you almost never write it by hand, because OTP behaviours do the crossing for you (more on that below). But understanding it explains everything above it: the reason gen_server is upgradable is that OTP's generic loop already makes the fully qualified call into your callback module on every message.

The kernel functions that drive it

All of this is orchestrated by the code server in the kernel application. You can drive it directly, which is exactly what tooling and the Erlang shell do under the hood.

%% Compile and reload a module in one go (what the shell's l(Module) does):
code:load_file(my_module).
%% => {module, my_module}
%% Marks any currently loaded my_module as "old", loads the new one as "current".

%% Force-remove the old version. Processes still running it are KILLED.
code:purge(my_module).
%% => true  if it had to kill anyone, false otherwise

%% The polite alternative: refuse to purge if anyone is still in old code.
code:soft_purge(my_module).
%% => true if purged successfully, false if processes were still running old code

%% Is the module loaded at all? Returns {file, Loaded} or false.
code:is_loaded(my_module).
%% Ask whether OLD code still exists for a module (this BIF lives in the
%% erlang module, not the code module):
erlang:check_old_code(my_module).   %% true if there is lingering old code

The pairing of purge/1 and soft_purge/1 is the whole safety story in miniature. purge/1 is the blunt instrument - it guarantees the old slot is freed, killing stragglers if it must, which is what lets you always load again. soft_purge/1 is the careful one - it fails rather than kill anyone, so an upgrade script can check "has everybody migrated yet?" and back off if not. Tooling typically loops on soft_purge, nudging stragglers with code_switch-style messages, and only resorts to a hard purge as a last resort.

In the shell, l(my_module) is just sugar for "compile if needed and load_file", which makes hot loading the default development experience: edit a file, type l(mod)., and the running node is patched - connections, state, and all.

Migrating state: the code_change callback

Reloading code is the easy half. The hard half is state. A gen_server that has been running for a week is holding a state term whose shape was defined by the old code. If your new version expects a different shape - a new field, a different record, a map instead of a tuple - simply swapping the code leaves the process holding a value the new code cannot understand.

OTP solves this with the code_change/3 callback. During a controlled upgrade, the behaviour calls it once, handing over the process's existing state so you can transform it into the new shape:

code_change(OldVsn, State, Extra) -> {ok, NewState} | {error, Reason}.

OldVsn is the version you are upgrading from (or {down, Vsn} for a downgrade), taken from the module's -vsn(...) attribute.
State is the live state term, in its old shape.
Extra is whatever you passed in the upgrade instruction - handy for parameterising the migration.

A concrete example: version 1 stored a plain integer count; version 2 wants a map so it can also track a timestamp. The callback bridges the gap, so a server that has been counting for days keeps its count:

%% In the NEW (v2) version of the module:
-vsn("2").

%% Old state was just an integer N.  New state is a map.
code_change("1", Count, _Extra) when is_integer(Count) ->
    {ok, #{count => Count, last_bumped => undefined}};
code_change(_OldVsn, State, _Extra) ->
    {ok, State}.   %% nothing to migrate

If you have nothing to migrate, the trivial form is everywhere:

code_change(_OldVsn, State, _Extra) -> {ok, State}.

Elixir wraps the identical callback - it is the same :gen_server underneath - with @impl and the friendlier def:

defmodule Counter do
  use GenServer

  # Upgrading from the integer-state version to the map-state version.
  @impl true
  def code_change("1", count, _extra) when is_integer(count) do
    {:ok, %{count: count, last_bumped: nil}}
  end

  def code_change(_old_vsn, state, _extra), do: {:ok, state}
end

For this to fire correctly, the server is first suspended so it stops handling requests, then asked to change code, then resumed. The release handler does this for you via sys:suspend/1, sys:change_code/4, and sys:resume/1 - and you can do the same by hand for a single process:

sys:suspend(my_server),
sys:change_code(my_server, my_module, "1", []),  %% triggers code_change/3
sys:resume(my_server).

Suspending matters: it freezes the mailbox-draining loop at a clean boundary so the state migration is atomic with respect to incoming messages.

Releases: appup and relup

Reloading one module by hand is great for development. Production wants something coordinated: dozens of changed modules across several applications, loaded in the right order, with state migrations, all as one transactional step that can be rolled back if anything goes wrong. That is what OTP's release handling provides, and it is built entirely on the primitives above.

The workflow has two artefacts you write or generate:

1. The .appup file describes how to upgrade one application from a previous version to the current one (and back down). Its shape is a version plus an up-list and a down-list of instructions:

%% ch_app.appup - upgrade ch_app to "2", from/to "1"
{"2",
 [{"1", [{load_module, ch3}]}],   %% UP:   from "1", just reload module ch3
 [{"1", [{load_module, ch3}]}]}.  %% DOWN: back to "1", same instruction

The two high-level instructions you reach for most:

{load_module, Module} - simple code replacement. Reload a stateless or compatibly-changed module. No suspend, no code_change; this is just load_file plus a purge, coordinated.
{update, Module, {advanced, Extra}} - synchronised code replacement. For modules whose processes hold state that must migrate. This is the instruction that triggers the full suspend → code_change/3 (with your Extra) → resume dance. There is also {update, Module, supervisor} for changing a supervisor's child spec.

2. The relup file describes how to upgrade the whole release - every application together. You rarely write it; systools:make_relup/3,4 generates it from the new and old .rel files plus the per-application .appup files, working out the correct global ordering.

At runtime the release_handler unpacks the new release and installs it by walking the relup instructions step by step:

%% On the live node:
release_handler:install_release("2").
%% => {ok, "1", []}    now running v2

%% If it sticks, commit it so it survives a reboot:
release_handler:make_permanent("2").

If any step fails, the system reboots into the old version - the upgrade is, in effect, transactional. And until you call make_permanent/1, a plain reboot reverts to the previously permanent release, giving you a safety net while you watch the new version behave.

Where the five languages stand

Hot code loading is a property of the runtime, so anything running on the BEAM benefits - but how directly each language engages with it varies.

Erlang is the reference: every mechanism above is Erlang's own, and code_change/3, .appup, and relup are first-class parts of OTP.

Elixir uses the exact same machinery - code_change/3 is the very same :gen_server callback, with the identical (old_vsn, state, extra) arguments, and iex reloads modules just like the Erlang shell. The wrinkle is tooling: Elixir's old hot-upgrade tool, Distillery, could generate appup/relup files for you (and even sported a mix release --upgrade flag), but when releases moved into core Mix (mix release), that machinery did not come along - core Mix deliberately does not support hot upgrades. You can still do hot upgrades on an Elixir release, but you hand-write the .appup files and drive the release handler yourself - the BEAM capability is fully there; the convenience layer thinned out.

# Elixir's code_change is the same OTP callback (three-arg form):
@impl true
def code_change(_old_vsn, state, _extra), do: {:ok, state}

LFE is Erlang with Lisp syntax, so it inherits all of it directly - code_change is just another callback in your gen_server behaviour module:

(defun code_change (old-vsn state extra)
  (tuple 'ok state))

Gleam is the interesting case, because it is statically typed and hot upgrades are fundamentally untyped at the seams. All the usual Erlang reloading works - Gleam compiles down to BEAM modules like everyone else - but the Gleam team is explicit that you cannot type-check the upgrade itself: there is no way to know, at compile time, the type of state held by code that is already running. So across an upgrade boundary you get the ordinary Erlang level of safety, not Gleam's usual guarantees. Gleam's OTP libraries are also tuned for type safety over upgradability (they model state with records rather than the atom-module conventions OTP's upgrade path assumes), which makes hand-written state-migration callbacks more involved. Hot reloading works; it is simply not where Gleam invests, because its whole premise is "catch it at compile time," and a live upgrade is the one thing the compiler cannot see.

Luerl sits below this entirely. A Lua script under Luerl is not a BEAM module - it is data interpreted by the Luerl library inside some Erlang process. There is no Lua .beam file to swap and no two-version rule for Lua code. To "hot reload" a Lua script you simply re-read its source and re-compile it to a fresh Luerl chunk inside the host process; the host Erlang/Elixir module that embeds Luerl is, of course, hot-loadable in the normal way. So Luerl gets a different, arguably simpler form of live update - replace a string, re-evaluate - precisely because Lua never became BEAM bytecode in the first place.

-- A Luerl-hosted script: the host can drop in a new version of THIS source
-- and recompile it at any time. No BEAM code-loading rules apply to Lua itself;
-- the surrounding Erlang/Elixir module is what gets hot-loaded normally.
local function tariff(minutes) return minutes * 0.05 end
return tariff(120)

Takeaway

Hot code loading looks like magic the first time you patch a running server without dropping a connection, but it is the orderly consequence of a few deliberate decisions: keep two versions of every module, let processes stay in old code until they choose to cross via a fully qualified call, force migration with a bounded purge, and migrate state through a single well-defined callback. OTP stacks releases on top so an entire system can move forward - or roll back - as one transaction.

It exists because a generation of engineers building telephone switches refused to accept downtime as the price of progress. The AXD301's nine nines were not won by deploying carefully; they were won by never stopping at all. That requirement is fossilised in the BEAM, and every language on the VM inherits the ability to upgrade the floor while still standing on it.