Functional programming

Another way in which we can approach the structure of a program is Functional Programming.

One could be tempted to think that Functional Programming is the paradimg that uses functions, but that would be a mistake. As we have already seen, functions are the key concept in the Procedural approach and functions, although they are called methods, are a key component of the classes used in the Object Oriented approach. It is called Functional Programming, because in this approach functions are used in new ways, they can, for instance, be arguments passed to other functions. In functional programming we avoid changing state. In practical terms, this means functions can be passed around like data, and we minimize the use of variables that get updated.

Functional Programming is normaly explanied in very theoretical terms. This kind of theoretical explanation might have its uses, but you won’t find it here. In my day to day work I use functional programming concepts all the time, but always motivated by a practical need. This functional approach is especially useful when dealing with big data sets, and that is why I use it. For instance, in bioinformatics, you might have to process millions of sequence reads. Using iterators and generators (a functional style) means you don’t need to load all data at once – you can process one record at a time, which is memory-efficient.

Let’s do an example. Imagine that we want to sum all the numbers from 1 to 9. We could do it using a procedural approach with the following code.

Or we could solve it using a functional approach.

Iterators

Iterators and iterables

To sum some numbers, for instance the first 1000 integers, we don’t need to create a list to hold them in memory, we only need to get those numbers one after the other, one at a time. In Python that’s the difference between a sequence and an iterator.

Lists or strings are examples of sequences, they are objects that:

allow random access, we can access every item by its position in the sequence
have a length, they can return the number of items that they contain

We could use those properties to get a list of numbers summed.

But, we could accomplish the same goal, with much less. We don’t need the items to be ordered, or to know the total number of items or, even, to have all the numbers in memory at the same time; we just need to access one element at a time, that’s all.

This time we have used range. range objects don’t hold the numbers in memory, they create them one at a time, when the iteration, the for loop, demmands it. In this way, we have managed to add those numbers using much less memory. What we have done is to approach the problem using the iterator interface. An iterator is an object that:

represents a stream of data, it yields its items one at a time.
it raises a StopIteration exception when there are no more items to yield.

Let’s sum the first ten numbers in a way that reflects the iterator protocol.

We get each number, one at at time, by calling next on the iterator until the iterator raises a StopIteration, and then we break the loop, we are done. That is in fact how the for loop works internally, it keeps calling next on the given object until it gets a StopIteration.

Well, in fact, the for loop can make use of iterators, but also of other kind of objects. Lists, for instance, are not iterators. If you try to call next on a list you’ll get an error: “TypeError: ‘list’ object is not an iterator”

Lists are iterable, and we can create iterators from iterables by using the iter built-in function.

This is, more or less, what for does for us and, in that way, it can deal with both iterables and iterators.

Python is fond of iterables and iterators, and has many functions that use them. Many common Python objects, although you might think that are sequences, are, in fact, iterators, like: objects returned by the dict keys, values and items, range or opened text files. And it also has the itertools module and the map, filter and reduce functions.

To sum the first numbers, we could even skip the for loop entirely.

Now that we don’t need to hold this numbers in memory we can use much less memory and sum the numbers faster.

If we would try to do the same creating the whole list, we would need much more time and memory.

Lazy and eager

range creates the numbers in a lazy way, it only creates a number when it is required. The opposite approach, the one used by the list, that creates all items ahead of time, is known as eager.

Lazy approaches delay creation or computation as much as possible, while eager approaches create or compute as soon as they are instantiated or called.

Generators

Generators are a very convinient Python mechanism that allow us to create our own iterators.

Generators look decively similar to a standard Python function, with the only difference that they use the yield statement instead of return. However, generators behave quite differently than standard functions.

While the function only returns one item and stops, the generator is capable of yielding one item after another until it runs out of items. The call to the generator does not return any item at all, but a generator iterator object that, as any other iterator, will be able to yield items every time we ask for them using next.

While each function call is independent, the function restarts its execution every time it is called from the very begining, generator iterators resume its execution every time we ask them to yield a new item.

Generators are a very convenient way of creating our own iterators. For instance, we could create a generator that yielded the first n number (a very similar functionality that the one provided by the built-in range).

Or we could create a generator of endless random numbers.

Iterators are consumed

Iterators are great tools, but you have to be aware of their behaviour, in particular, they are consumed.

Lists, on the contrary, are not consumed, we could go through them as many times as we want to.

Think of iterators a stream of items that is seen only once.

You should take into account this behaviour when creating algorithms that work with iterators. Although once you get used to them that won’t be a problem for many algorithms, it can be challenging at the begining to think on how to approach certain problems using iterators.

itertools

If you plant to use iterators at all you should take a look at the itertools module, and if you want more functionality, you could check out the more-itertools package.

Pure and immutable

Pure functions

A pure function is a function that:

given the same arguments it will return the same values
does not have any side effects

Pure functions have some advantages over functions that, for instance, have side effects, that change a external states. As we discussed in the procedural approach, side effects destroy modularity and make code more difficult to reason about and maintain.

immutable objects

Changing the state of the objects passed to the function breaks modularity. So, programming is much easier, for instance code is easier to debug, when we don’t mutate the state of the objects passed to the function.

It is nice to have functions that do not change the objects that we lend to them, but the caller does not allways have control over the function implementation and one way to enforce this is to pass to the function only immutable objects.

Cache and paralellization

When using pure functions and immutable objects we can apply very easyly, tricks to improve the performance of our code.

We could, for instance, cache the result of time consuming computations. Since the function given the same arguments always returns the same values and the arguments are not changed, we can return a precomputed instance of the result.

The lru_cache decorator will cache the result of the function, so if you call the function several times with the same arguments, it will return the cached result and save you some possible expensive computations. Caching can, in some cases, improve code performance by a lot.

from functools import lru_cache

@lru_cache
def sum_numbers(a, b):
    print(f"Summing {a} + {b}")
    return a + b

print(sum_numbers(1, 1))
print(sum_numbers(1, 1))
print(sum_numbers(2, 2))
print(sum_numbers(1, 1))
print(sum_numbers(2, 2))

Of course, lru_cache will not behave propertly with non-pure functions or with mutable arguments.

Pure functions and immutable objects are also very convenient when dealing with large amounts of processing because it is much more easier to paralelize programs built with them.

Functions are first class citizens

In Python, like almost anything else, functions are just objects, and they can be used as arguments to other functions or can be returned by other functions.

The functools has many utilies that take functions as arguments and that, in most cases, return functions. For instance, the lru_cache decorator, like any other decorator is in fact a function that takes a function as an argument and returns another function.

partial

partial is another great utility of the functools module, it returns a new function which has some of their arguments precalled (or freezed).

from functools import partial

def sum_numbers(a, b):
    return a + b


add_2 = partial(sum_numbers, b=2)   # add_2 is a new function, created by partial
print(add_2(1))
print(add_2(7))

lambda

lambdas are small anonymous functions, functions with no name. They are usually passed as arguments to other functions.

One very common use case is to use them as the key argument of the sorted funtion when sorting.

Comprenhensions

All this functional utilities can be used using several comprenhensions:

Comprehensions provide a very consise way to create lists, dict, sets and generators by computing its values.

The general idea is: for each element in an iterable, if the element matches a condition, do something with the element and store the result in a list/dict/set.

The if condition is optional, we can create comprehensions without it.

The syntaxis for the dict and set comprehensions is very similar.

And, finnaly, if we want, instead of creating a list eagerly, we could create a lazy generator, but changing the square brackets by parentheses.

filter, map, reduce

Another very useful, and one of the most common, way of using the functional approach is to use filter, map and reduce functions. All these utilities are designed to work with iterators, with streams of data, and the map/reduce pattern was thought as a tool to process big quantities of data.

flowchart TB
    subgraph filter
        f["filter(is_square_funct, item_iterator)"]
        subgraph filter_drawing
            subgraph before_filter[" "]
                1f[" "]
                2f((" "))
                3f((" "))
                4f[" "]
                5f[" "]
            end
            subgraph after_filter[" "]
                1fa[" "]
                4fa[" "]
                5fa[" "]
            end
            before_filter ---> after_filter
        end
    end

flowchart TB
    subgraph map
        m["map(to_square_funct, square_iterator)"]
        subgraph map_drawing[" "]
            1s[" "]
            2s[" "]
            3s[" "]
            4s[" "]
            5s[" "]

            1c((" "))
            2c((" "))
            3c((" "))
            4c((" "))
            5c((" "))

            1s ---> 1c
            2s ---> 2c
            3s ---> 3c
            4s ---> 4c
            5s ---> 5c
        end
    end

This is a somewhat convoluted way of writing the following algorithm:

The map-reduce approach splits the iteration functionality into the map and reduce functions, we have transformed the for, a loop statement, into a series of function calls.

This approach offers the possibility of doing the computations in parallel because each call to map and reduce is independent, as long as we are using pure functions (no side effects), and the functions do not modify the objects passed to them (something that we could enforce using immutable objects).

Python, for instance, has a multiprocessing map that allow us to apply map in parallel.

If we can transform the computation that we are interested in carrying out into a series of pure functions, we can use this trick to do massive, and time and and memory efficient, computations in parallel. This is one of the main practial uses of the functional approach.

Example: filter sequences

We have a list of DNA sequences and we want to filter out the ones that have Ns in them. We also want to calculate the length of the sequences that remain, the average length and the total length. We could do it like this:

Now substitute the for loop with list comprenhensions.

Finnaly, do it using filter, map, without using for loops or list comprenhensions.

The above code does not work as expected. What’s the problem with it? Could you fix it?

Now do it without even having to create a list with all the valid sequences in memory, without using lists or list comprenhensions.