Vectorization of builders

Currently, builders store the Hamiltonian as a Python dictionary of adjacency lists (interleaved with values), as described in a comment in builder.py:

    # For a given `tail` site, H[tail] is a list alternately storing
    # heads and values.  (The heads occupy even locations followed by the
    # values at odd locations.)  Each pair of entries thus describes a single
    # directed edge of the graph.

Dealing with a system of N sites requires O(N) Python operations. With the vectorization of low-level systems, this can become a bottleneck, or at least a nuisance.

Since we need update the builder finalization code anyway (so that builders with more than a single translational symmetry can be finalized), why not try to solve both problems in one go and vectorize builders?

Fortunately, in terms of API (almost) nothing needs to change. Already now builders offer a vectorized interface in that many elements can be set with a single Python operation: a builder can be indexed with a sequence of sites. A natural evolution of this would be to also allow sitearrays and "tuple arrays", i.e. 2-tuples of sitearrays. This would allow to vectorize shape, neighbors, and fill and as such most old user scripts, without changing anything on the user side.

The more complicated part is reworking the data structures used internally by builders to efficiently support vectorization. Let’s try to analyze in a language-agnostic way what data structure would be best suited:

We should optimize for a small number of site families, just like we already do in the new low-level format. This means that instead of storing a single graph indexed by (pairs of) sites, we could store separate graphs for each pair of families. These vertices of these graphs would be indexed by tags. I think that we could restrict valid tags to sequences of numbers without loosing anything. Together, this means that what we need is an efficient data structure to store graphs over vertices that are identified by sequences of numbers.
We could also optimize for a small number of different values, since this is the most common usage by far. Ideally, the use case of having a different constant value for each site/hopping would be still viable, but I am not even sure if such systems were ever made.

A hash table of adjacency lists still seems the right data structure, only that it should be vectorized and it would be indexed by sequences of numbers (of same length) and not by generic Python objects.

I can’t see how to implement something like this efficiently in pure Python, but writing it as a C extension is not a very big deal - much simpler then, say, writing tinyarray. Perhaps, actually, a vectorized dictionary with keys that are sequences of numbers would be a useful addition to tinyarray?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information