Supervisors in Elixir

In my previous article we were talking about Open Telecom Platform (OTP) and, more specifically, the GenServer abstraction that makes it simpler to work with server processes. GenServer, as you probably remember, is a behaviour—to use it, you need to define a special callback module that satisfies the contract as dictated by this behaviour.

What we have not discussed, however, is error handling. I mean, any system may eventually experience errors, and it is important to take of them properly. You can refer to the How to Handle Exceptions in Elixir article to learn about the try/rescue block, raise, and some other generic solutions. These solutions are very similar to the ones found in other popular programming languages, like JavaScript or Ruby.

Still, there is more to this topic. After all, Elixir is designed to build concurrent and fault-tolerant systems, so it has other goodies to offer. In this article we will talk about supervisors, which allow us to monitor processes and restart them after they terminate. Supervisors are not that complex, but pretty powerful. They can be easily tweaked, set up with various strategies on how to perform restarts, and used in supervision trees.

So today we will see supervisors in action!

Preparations

For demonstration purposes, we are going to use some sample code from my previous article about GenServer. This module is called CalcServer, and it allows us to perform various calculations and persist the result.

All right, so firstly, create a new project using the mix new calc_server command. Next, define the module, include GenServer, and provide the start/1 shortcut:

# lib/calc_server.ex

defmodule CalcServer do
  use GenServer

  def start(initial_value) do
    GenServer.start(__MODULE__, initial_value, name: __MODULE__)
  end
end

Next, provide the init/1 callback that will be run as soon as the server is started. It takes an initial value and uses a guard clause to check if it’s a number. If not, the server terminates:

def init(initial_value) when is_number(initial_value) do
    {:ok, initial_value}
end

def init(_) do
    {:stop, "The value must be an integer!"}
end

Now code interface functions to perform addition, division, multiplication, calculation of square root, and fetching the result (of course, you can add more mathematical operations as needed):

  def sqrt do
    GenServer.cast(__MODULE__, :sqrt)
  end

  def add(number) do
    GenServer.cast(__MODULE__, {:add, number})
  end

  def multiply(number) do
    GenServer.cast(__MODULE__, {:multiply, number})
  end

  def div(number) do
    GenServer.cast(__MODULE__, {:div, number})
  end

  def result do
    GenServer.call(__MODULE__, :result)
  end

Most of these functions are handled asynchronously, meaning we are not waiting for them to complete. The latter function is synchronous because we actually want to wait for the result to arrive. Therefore, add handle_call and handle_cast callbacks:

  def handle_call(:result, _, state) do
    {:reply, state, state}
  end

  def handle_cast(operation, state) do
    case operation do
      :sqrt -> {:noreply, :math.sqrt(state)}
      {:multiply, multiplier} -> {:noreply, state * multiplier}
      {:div, number} -> {:noreply, state / number}
      {:add, number} -> {:noreply, state + number}
      _ -> {:stop, "Not implemented", state}
    end
  end

Also, specify what to do if the server is terminated (we’re playing Captain Obvious here):

  def terminate(_reason, _state) do
    IO.puts "The server terminated"
  end

The program can now be compiled using iex -S mix and used in the following way:

CalcServer.start(6.1)
CalcServer.sqrt
CalcServer.multiply(2)
CalcServer.result |> IO.puts
# => 4.9396356140913875

The problem is that the server crashes when an error is raised. For example, try to divide by zero:

CalcServer.start(6.1)
CalcServer.div(0)
# [error] GenServer CalcServer terminating
# ** (ArithmeticError) bad argument in arithmetic expression
#    (calc_server) lib/calc_server.ex:44: CalcServer.handle_cast/2
#    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
#    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
#    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
# Last message: {:"$gen_cast", {:div, 0}}
# State: 6.1
CalcServer.result |> IO.puts
# ** (exit) exited in: GenServer.call(CalcServer, :result, 5000)
#    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
#    (elixir) lib/gen_server.ex:729: GenServer.call/3

So the process is terminated and cannot be used anymore. This is indeed bad, but we are going to fix this really soon!

Let It Crash

Every programming language has its idioms, and so does Elixir. When dealing with supervisors, one common approach is to let a process crash and then do something about it—probably, restart and keep going.

Many programming languages use only try and catch (or similar constructs), which is a more defensive style of programming. We are basically trying to anticipate all the possible problems and provide a way to overcome them.

Things are very different with supervisors: if a process crashes, it crashes. But the supervisor, just like a brave battle medic, is there to help a fallen process recover. This may sound a bit strange, but in reality that is a very sane logic. What’s more, you can even create supervision trees and this way isolate errors, preventing the whole application from crashing if one of its parts is experiencing problems.

Imagine driving a car: it is composed of various subsystems, and you cannot possibly check them every time. What you can do is fix a subsystem if it breaks (or, well, ask a car mechanic to do so) and continue your journey. Supervisors in Elixir do just that: they monitor your processes (referred to as child processes) and restart them as needed.

Creating a Supervisor

You can implement a supervisor using the corresponding behaviour module. It provides generic functions for error tracing and reporting.

First of all, you would need to create a link to your supervisor. Linking is quite an important technique as well: when two processes are linked together and one of them terminates, another receives notification with an exit reason. If the linked process terminated abnormally (that is, crashed), its counterpart exits as well.

This can be demonstrated using the spawn/1 and spawn_link/1 functions:

spawn(fn ->
  IO.puts "hi from parent!"
  spawn_link(fn ->
    IO.puts "hi from child!"
  end)
end)

In this example, we are spawning two processes. The inner function is spawned and linked to the current process. Now, if you raise an error in one of them, another will terminate as well:

spawn(fn ->
  IO.puts "hi from parent!"
  spawn_link(fn ->
    IO.puts "hi from child!"
    raise("oops.")
  end)
  :timer.sleep(2000)
  IO.puts "unreachable!"
end)
# [error] Process #PID<0.83.0> raised an exception
# ** (RuntimeError) oops.
#    gen.ex:5: anonymous fn/0 in :elixir_compiler_0.__FILE__/1

So, to create a link when using GenServer, simply replace your start functions with start_link:

defmodule CalcServer do
  use GenServer

  def start_link(initial_value) do
    GenServer.start_link(__MODULE__, initial_value, name: __MODULE__)
  end
  # ...
end

It’s All About Behaviour

Now, of course, a supervisor should be created. Add a new lib/calc_supervisor.ex file with the following contents:

defmodule CalcSupervisor do
  use Supervisor

  def start_link do
    Supervisor.start_link(__MODULE__, nil)
  end

  def init(_) do
    supervise(
      [ worker(CalcServer, [0]) ],
      strategy: :one_for_one
    )
  end
end

There is a lot going on here, so let’s move at a slow pace.

start_link/2 is a function to start the actual supervisor. Note that the corresponding child process will be started as well, so you won’t have to type CalcServer.start_link(5) anymore.

init/2 is a callback that must be present in order to employ the behaviour. The supervise function, basically, describes this supervisor. Inside you specify which child processes to supervise. We are, of course, specifying the CalcServer worker process. [0] here means the initial state of the process—it is the same as saying CalcServer.start_link(0).

:one_for_one is the name of the process restart strategy (resembling a famous Musketeers motto). This strategy dictates that when a child process terminates, a new one should be started. There are a handful of other strategies available:

:one_for_all (even more Musketeer-style!)—restart all the processes if one terminates.
:rest_for_one—child processes started after the terminated one are restarted. The terminated process is restarted as well.
:simple_one_for_one—similar to :one_for_one but requires only one child process to be present in the specification. Used when the supervised process should be dynamically started and stopped.

So the overall idea is quite simple:

Firstly, a supervisor process is started. The init callback must return a specification explaining what processes to monitor and how to handle crashes.
The supervised child processes are started according to the specification.
After a child process crashes, the information is sent to the supervisor thanks to the established link. Supervisor then follows the restart strategy and performs the necessary actions.

Now you can run your program again and try to divide by zero:

CalcSupervisor.start_link
CalcServer.add(10)
CalcServer.result # => 10
CalcServer.div(0)
# => error!
CalcServer.result # => 0

So the state is lost, but the process is running even though an error has happened, which means that our supervisor is working fine!

This child process is quite bulletproof, and you literally will have a hard time killing it:

Process.whereis(CalcServer) |> Process.exit(:kill)
CalcServer.result
# => 0
# HAHAHA, I am immortal!

Note, however, that technically the process is not restarted—rather, a new one is being started, so the process id won’t be the same. It basically means that you should give your processes names when starting them.

The Application

You may find it somewhat tedious to start the supervisor manually every time. Luckily, it is quite easy to fix by using the Application module. In the simplest case, you will only need to make two changes.

Firstly, tweak the mix.exs file located in the root of your project:

  # ...
  def application do
    # Specify extra applications you'll use from Erlang/Elixir
    [
      extra_applications: [:logger],
      mod: {CalcServer, []} # <== add this line
    ]
  end

Next, include the Application module and provide the start/2 callback that will be run automatically when your app is started:

defmodule CalcServer do
  use Application
  use GenServer

  def start(_type, _args) do
    CalcSupervisor.start_link
  end
  # ...
end

Now after executing the iex -S mix command, your supervisor will be up and running right away!

Infinite Restarts?

You may wonder what is going to happen if the process constantly crashes and the corresponding supervisor restarts it again. Will this cycle run indefinitely? Well, actually, no. By default, only 3 restarts within 5 seconds are allowed—no more than that. If more restarts happen, the supervisor gives up and kills itself and all the child processes. Sounds horrifying, eh?

You can easily check it by quickly running the following line of code over and over again (or doing it in a cycle):

Process.whereis(CalcServer) |> Process.exit(:kill)
# ...
# ** (EXIT from #PID<0.117.0>) shutdown

There are two options that you can tweak in order to change this behaviour:

:max_restarts—how many restarts are allowed within the timeframe
:max_seconds—the actual timeframe

Both of these options should be passed to the supervise function inside the init callback:

  def init(_) do
    supervise(
      [ worker(CalcServer, [0]) ],
      max_restarts: 5,
      max_seconds: 6,
      strategy: :one_for_one
    )
  end

Conclusion

In this article, we've talked about Elixir Supervisors, which allow us to monitor and restart child processes as needed. We've seen how they can monitor your processes and restart them as needed, and how to tweak various settings, including restart strategies and frequencies.

Hopefully, you found this article useful and interesting. I thank you for staying with me and until the next time!

John Fegan

21:25 22 Jul 21

Professional and rapid support for the development and ongoing maintenance of our business website. Nothing too much... trouble. Highly recommended.read more

Paul Jackson

11:24 28 Jun 21

Superb service very professional and quickvery responsive i was kept informed and consulted throughout the entire... process.read more