Yabu Incremental Typesystem

Abstract

Yabu is all about interactivity and the typechecking needs to be fast and furious. It also needs to deal with arbitrary staging state and we need to ensure that we merge the maximal staging subset that typechecks.

Bidirectional Type Checking

In Yabu we use bidirectional typechecking, you can check this post for a reference that I used to learn how that works.

This section is under construction, I’ll add the full Yabu typing judgements in the near future.

Fast Incremental Type Checking with Staging Area

You can read a little bit more about the staging area in the main Yabu hub article but basically the idea is that a Yabu image always typechecks, so any re-definition of a function that doesn’t type check is left in a staging area.

This staging area can be arbitrarily big, you may have several 10-20 entries in the staging area waiting to be merged. These may be mutually recursive functions that are interconnected and will have to be merged all at once or independent parts of the program that are just broken at the moment.

The problem of fast incremental type checking with a staging area is that you want any re-definition of a symbol to merge the maximal subset of the staging area it can because you don’t want to have things in the staging area that can be merged.

We want the following property after resolving the staging area:

\nexists\, T \subseteq S,\; T \neq \emptyset : \mathrm{Mergeable}(T)

There is no $T$ that is a non-empty subset of $S$ (the staging area), such that the subset is mergeable. Where mergeable is defined as (given $C$ is the committed image)

\mathrm{Mergeable}(T) \iff \mathrm{Typecheck}(C \cup T)

Naive Approach

At the start I was embracing exponentials and I started by basically going over every possible subset ordered by reverse size and checking if that subset is mergeable.

With small staging areas that’s fine, but of course as soon as you get into the 10 items in the staging area territory you are cooked ( $2^{10} = 1024$ ) and once you reach 20 you are in for some pain ( $2^{20} \approx 1000000$ ).

We needed something a bit smarter than that

Pruning

The first realization is that since we are adding/updating definitions one-by-one, only parts of the staging area that are in any way connected to the symbol we are redefining matter. The new definition cannot affect the staging status of definitions in the staging area that are not in any way connected to it. This of course, while important, is not enough. In many cases the whole size of the staging area will be related to what you are currently building, which in many cases will be related. So for the most usual usage pattern this is going to reduce the amount of subsets we need to check very little.

Connected Components

The second realization is that there will be disconnected components in the graph.

If symbol A depends on B which in turn depends on C. Then we have D -> E -> C. The set of (A, B) is not connected to (D, E) except through C. If we are redefining C we can test the (A, B, C) and (D, E, C) sections of the graph independently because they are independent from each other.

This reduces the cost of checking this from $2^4 = 16$ to $2 \times 2^2 = 8$ . With small staging graphs this is not that big of a win. But for a staging area of 10 symbols with each connected component being of 2 we go from $1024$ to $5 * 2^2 = 20$ . Strong gains!

Reverse-search like thingy

The third and last realization is that we don’t need to test all subsets of each connected component of the graph. We only need to check subsets that are connected from the root (the node we’ve changed).

If you have just changed $C$ and you have $C \sim B$ and $B \sim A$ then checking the subset $A$ and $B$ is nonsensical because that has already been checked before (the only thing that changed is $C$ ) and checking $C and A$ is also nonsensical because they don’t affect each other except through $B$ .

So after modifying $C$ the only subsets that you need to check are those that are reachable from the root ( $C$ , the changed symbol). This reduces the subsets we need to check significantly as well.

For example for a chain of 4 symbols $A \sim B \sim C \sim D$ after the connected components optimization we would need to check $2^4 = 16$ subsets, with this optimization we reduce that to $4$ . Of course a chain like this is the perfect case for this, each branch in the chain increases the subsets quite a bit. But by combining the connected components and the reverse search together the reverse-search degenerate cases are not relevant.

So you need to start from the changed symbol, and for each connected component you find do a kind of reverse-search algorithm to find all subsets of the graph that are connected to the root.

If you know a better name for this, please let me know!

Conclusion

With this the problem becomes a lot more tractable and the amount of work you need to do for fairly large staging areas becomes easier. There may be some more optimizations left in there but I don’t think I need it for now.

To test this I used the very stupid but obviously correct first naive argument as an oracle and used some Property-Based testing.

— marce coll, 2026-04-23