Documenting correctness in Haskell
In the Functional Programming 1 course we teach at Uppsala University, there is some confusion about how to document code. Every year, there is confusion. I suppose part of the problem is that many university students are not experienced enough to have seen large messy code bases. They have not yet worked on a single code base for more than 3 years continuously. Therefore, they have not yet understood that everyone needs documentation to help themselves sane.
For some reason, functional programmers seem extra careful about correctness, so we put extra emphasis on the documentation in relation to correctness. This blog post is a summary, recap and motivation for some of the documentation guidelines we use in the course. I don’t even remotely claim that the documentation in this post is enough to be a best practice, but I hope it provides some examples and motivation.
Function specifications
Functions are the core of our functional programming. Here, we put in the most effort.
The function name, argument names and the return
The first line in the function specification is just the function name and identifiers for the arguments
Since implementation of the function can have several patterns that introduce different code branches with different symbols in them, we define common names for arguments in the function specification.
In the below example, we match on a datatype that have different symbols in different matches.
If the vehicle is a train, we will in implementation introduce the symbol cars
but that is not available in the other pattern match.
Therefore, we make sure to introduce all symbols we need in the function specification, on this very first line.
The point of the RETURNS
clause is self explanatory most of the time.
Preconditions
PRE
, for preconditions. Similar to the idea in Design by Cotnract. We use the term in two principal ways
First, if the function operates on values of a type a
, it is possible that the function is not total, and the domain of definition is smaller. The preconditions define the reduced domain of definition. Example with non-total function:
Secondly, in stateful code (e.g. monadic), the preconditions may also describe valid states before the function is called.
Below comes an example with state monad with a counter. The count must always be nonnegative. This function decrease the counter, but does not check if the current count is 0
. So the caller must check this.
Side effects
When a function has side effects, we document them in the SIDE EFFECT
clause. All things that make a function non-pure are side effects.
The major example is working with the IO monad.
Nonterminating code is also a side effect.
Any non-pure function can break our equational reasoning, so we want to know about them. Something as innocent as reading from stdin can cause a program to not be reliable.
Variants
Recursion variants are not as well described on the web as loop variants are. See e.g. wikipedia for a discussion. The idea is the same, nonetheless.
The variant is an expression of the function arguments (remember, you named these on the first line of the function specification!), it takes values in a set with total order, such that all descending sequences in that set terminate. Typically, you pick the nonnegative integers as the set of values for the variant. The variant must also be strictly decreasing on every recursive call. Given this, there is a theorem saying that every recursive function with a variant will terminate.
One of the simplest cases, working with only integers
Next example is on lists. Here, it is not certain that the base case is when the variant is 0. But it might be.
We might omit the variant if we do structural recursion, and can assume the datastructure to be finite.
If we have an internal helper, we might omit the function specification, but we must still include a variant statement.
Datatype Representations
Every new type, and data
in specific, should have a description of what it is, how it is represented, and whether the data type has any invariants.
The use of an invariant here is similar to a Class Invariant in Design by Contract, but has just slightly different semantics. It is a restriction on the data type, stating some criterion that all valid values of the type must fullfil.
In the best of worlds, we would not need invariants, since the type system should make invalid values unrepresentable. In reality, we often have good tradeoffs where a type is more efficient or simple to code with even if it can hold invalid values.
In the example below, we back a Set with a list, and to make the type not the worst possible, we require that the backing list never contains duplicates. This is documented on the datatype representation, and all functions returning values of the type must respect this.
I extended this example in this gist