Understanding Lists

Lists are a basic data structure in functional languages, but they are also quite confusing for people accustomed to dealing with mutable arrays in imperative languages. The two things: arrays and lists are used in similar situations, they are however more different than it may seem. I’d like to address that explaining how lists work and implementing couple of functions form Enum module ourselves.

What is a list?

Elixir (and Erlang, and most functional languages) as a primary collection data type use what is called in the literature a linked list. The basic building block of lists are cells (or cons cells) 1. [1 | 2] is a cell with 1 as head and 2 as tail. A cons cell simply joins the two values together2.

What it’s good for? It starts to shine when you add another small thing - the empty list []. Now we can make a full definition for what is a list:

A list is either and an empty list or a cell, where head is any element, and tail is another list.

With that definition we can see how all of the following are lists:

[]
[1 | []]
[1 | [2 | []]]
[1 | [2 | [3 | []]]]
[1 | [2 | [3 | [4 | []]]]]

As you can see it can become tedious to write those. That’s why we have syntactic sugar - [1 | []] is exactly the same as [1] and [1 | [2 | []]] is the same as [1, 2]. With that syntax we can rewrite the above examples to something more consumable:

[]
[1]
[1, 2]
[1, 2, 3]
[1, 2, 3, 4]

As with everything in Elixir, we can use all the constructors for destructing via pattern matching. With that knowledge it should be quite easy to understand the following examples:

iex> [a, b] = [1, 2]
iex> a
1
iex> b
2

iex> [a | [b | []]] = [1, 2]
iex> a
1
iex> b
2

iex> [a | rest] = [1, 2, 3]
iex> a
1
iex> rest
[2, 3]

iex> [a, b | rest] = [1, 2, 3]
iex> a
1
iex> b
2
iex> rest
[3]

We can mix and match the two notations for lists, whatever looks more natural in a given situation.

Improper lists

As you’ve seen in the beginning we can put a non-list as the tail of a cons cell. A list that ends with a non-list element is often called an improper list. But be careful with those - majority of functions that operate on lists expect proper lists and can behave in strange and unpredictable ways when given improper lists.

Working with lists

Now that we’ve defined what a list is, we should see what we can do with them. Let’s look at some functions from the standard library and create our own versions.

You need to be aware that normal functions from Enum module are polymorphic - they work not only with lists, but other collections too. Here we’ll only implement those functions for lists.

In reality many of those functions are also implemented differently or even direcly in C in the Erlang Virtual Machine (and called BIFs - built-in functions) for performance reasons.

Kernel.hd/1 and Kernel.tl/1

Kernel.hd/1 returns the first element of the list, i.e. the head of the list. It should be pretty easy to guess how to go about it with the pattern matching examples from above:

def hd([head | _tail]), do: head

Kernel.tl/1 complements the hd/1 function by returning the tail of the list:

def tl([_head | tail]), do: tail

Kernel.length/1 and tail recursion

Another pretty obvious thing we might want to do with a list is to calculate it’s length. To do this we need to traverse the whole list, but only using pattern matching won’t be enough. So how do we do it? Yes, you guessed right - recursion.

def length(list), do: length(list, 0)

defp length([], count), do: count
defp length([_head | tail], count), do: length(tail, count + 1)

What are we doing here? We define two functions - the main one that states that we know for sure that a length of any list is at least 0, and that delegates to the helper function. The helper function in turn states, that a length of a non-empty list is one more than the length of the tail. We complement the definition by saying that the length of an empty list is whatever we calculated so far.

If you wander why we haven’t simply defined it as:

def length([]), do: 0
def length([_head | tail]), do: 1 + length(tail)

That’s because this implementation will need to remember somewhere, that we need to finally calculate something of a form 1 + 1 + 1 + .... You can probably see that this won’t be very effective in terms of memory. So instead of remembering somewhere (this somewhere will be the stack) that we need to add all those ones together, we do it right away in our counter we carry with us. This style of recursion, the one that carries some sort of accumulator with you, is called tail recursion.

Enum.map/2 and body recursion

That’s probably the function we use the most with lists. It’s also the most natural one - you have a list of something and you want to transform each element. How does it work? Besides the collection, Enum.map/2 takes a function that will be applied to every element of the list. Let’s see how we can do it in terms of recursion.

def map([], _mapper), do: []
def map([head | tail], mapper), do: [mapper.(head) | map(tail, mapper)]

The implementation is quite simple: in an empty list all elements are already transformed - we’re done; in a non empty list, we need to ask some questions. What do we want to return? A list. What should be the head of that list? The head of the original list, but transformed with our function. And what should be the tail of our list? It should be the tail of the original list, transformed by Enum.map/2. That way we handled all possible cases - an empty list and a non-empty list. That means we’re done. Wasn’t that hard, was it?

You may notice that this function is different from the length/1 - we don’t have any accumulator. This style of recursion, done directly inside the function, is called body recursion. Of course we would still have the problem of having to remember to put all those elements in front of the list - but through some clever optimizations the Erlang VM guarantees that body-recursive functions on lists have the same performance as tail-recursive ones.

If we actually wanted to make this function tail recursive, we would need to construct the resulting list backwards, and reverse it in the end:

def map(list, mapper), do: map(list, mapper, [])

def map([], mapper, acc), do: reverse(acc)
def map([head | tail], mapper, acc), do: map(tail, mapper, [mapper.(head) | acc])

If you’re curious about performance of those solutions, I analyzed them further in this PR to Elixir (it turned out that Elixir was using a third solution, that is actually slower then both analyzed here): Optimize Enum.map/2 for lists.

And that takes us cleanly to another function I’d like to explore, that we used in this tail-recursive map/2:

Enum.reverse/2 and Enum.reverse/1

How do we reverse a list? We’ll take a small detour and first define a two argument version of reverse - reverse(to_reverse, reversed) function. The first argument is a list we need to reverse, and the second one is an already reversed list we’ll put at the tail of the list we actually want to reverse. You’ll see in a minute why we’re going with that one first.

def reverse([], reversed), do: reversed
def reverse([head | tail], reversed), do: reverse(tail, [head | reversed])

Once again it’s not very complicated. When the list to reverse is empty, we simply return the already reversed list. What in the case when it’s not empty? The only thing we need to do is to move the head from our list to the already reversed list and continue the recursion. By moving all the elements one by one we’ll put them in the reverse order they were in the beginning.

What should we do in reverse/1 when we don’t have the already reversed list? Let’s use a minimal list we already know is reversed - the empty list, and call our two-argument reversing function.

def reverse(to_reverse), do: reverse(to_reverse, [])

Kernel.++/2

The last function we’ll look at is the operator ++/2 that appends two lists. This will help you realize why it is discouraged to append to an accumulator in a loop.

def append([], rest), do: rest
def append([head | tail], rest), do: [head | append(tail, rest)]

Appending to an empty list is easy - we just return the second list. What should we do in case of a non-empty list? Let’s once again ask some questions. What do we want to return? A list. What should be the head of that list? It should be the head of the original list. What should be the tail of that list? It should be the second list appended to the tail of the original list.

I think this makes it clear why appending to the list is not recommended in the loops - the function needs to traverse the list we append to each time. It’s much better to prepend to the list, and reverse it at the end.

Conclusion

Lists are a basic data structure we’re dealing with constantly when doing functional programming, and it’s really important to understand how they work. It’s a really simple idea of cons cells that allows us all those great features. I feel like with functional programming there are extremely many cases of simple ideas turned into great and extremely helpful tools. It’s all about finding the minimal abstraction that will still work.

Further reading

If you’d like to explore more functions working on lists, there are couple of posts from the excellent “Core Elixir” series by Augie De Blieck Jr. that touch such functions:

  1. The name cons dates back to the first lisp language, where the function for __cons__tructing new cells was named cons. This name is still widely used in languages inspired by lisp.

  2. As the cons cell simply joins the two values together - you may wonder what’s the difference between that and a two element tuple. The truth is there’s conceptually none and we use two different data structures because they can be differently optimized, but we could create a list as a series of tuples like: {1 , {2, {3, {}}}}, and operate on it in the same way we operate on regular lists.