Mutable tinyarrays?
Tinyarrays are currently immutable. This is necessary so that they can be used as dictionary keys, which is the original application of tinyarray (with Kwant). This issue discusses whether and how the tinyarray module could be extended to support mutable arrays as well. It is motivated by requests from tinyarray users.
Purpose
-
With today’s tinyarray it is possible to write code like
a /= sqrt(dot(a, a)) b += a
The above results in the allocation of two new tinyarray objects. With mutable tinyarrays, both allocations could be avoided.
-
Individual elements of an array could be overwritten, allowing to modify elements of a matrix or looping over a vector:
v = zeros(2, int) for v[0] in range(10): for v[1] in range(10): ...
-
What else?
Types
Like tuples and strings, tinyarrays are immutable by design. Immutability is a prerequisite to hashability, which in turn is necessary for tinyarrays to be usable as dictionary keys. Otherwise, there is technically nothing that prevents tinyarrays from being mutated. After all the underlying memory is writeable.
For some applications mutable tinyarrays would be useful. I can see two ways of achieving that without breaking current applications and backwards compatibility:
-
Introduce new tinyarray types that correspond to today’s
ndarray_int
,ndarray_float
,ndarray_complex
, only that they are mutable and not hashable. That could be probably done without much code duplication by adding a second template parameter to theArray
class. That parameter would specify whether the underlying arrays are mutable or not. -
Extend the current types by introducing a flag that determines for each tinyarray instance whether it is mutable or not. Technically, C-API equivalents of magic methods like
__hash__
and__setitem__
would always be defined, but item assignment would only work for mutable, and hashing only for immutable variants.
The advantage of the first solution is that it would not require accommodating an additional flag in the tinyarray object. Unless we find some trick, an additional boolean flag would increase the object size by 8 bytes (on 64 bit platforms) due to alignment. See the section “data layout” below.
One advantage of the second solution is that the implementation will be most likely simpler. Another advantage would be that while it should not be possible to turn an existing immutable tinyarray object into a mutable one, the other way (=freezing) would be possible and potentially useful.
API
Independently of the implementation, the API could mostly stay the way it is. One would have to come up with a way of constructing mutable arrays. I see the following possibilites:
-
Add an option to
tinyarray.array
,tinyarray.zeros
, etc. Since for performance reasons tinyarray constructors only accept positional arguments, adding such a “mutable” option would mean it has to go behind the existing “dtype” argument. Thus specifying “mutable” would require specifying “dtype” as well. More importantly, a positional argument for mutability would be incompatible with NumPy. -
Prefix the mutable constructors, e.g.
tinyarray.marray
, etc. Because of backwards compatibility, immutable would have to remain the default (=without prefix). -
Put the mutable constructors into a separate namespace, e.g
tinyarray.mut.array
. This seems attractive, especially if the submodule could have the same contents. Then a user interested in mutable arrays could sayimport tinyarray.mut as ta
. On the other hand,mut.array
means two dict lookups instead of on. Finally, realising a Python package with submodules as a single file (which is what we want for simplicity and code reuse) is quirky. -
Initially mark all newly constructed tinyarrays as mutable and freeze them upon hashing or on-demand (
freeze
method). This solution may sound a bit too clever, but it would have the advantage of keeping the API simple. This possibility is discussed for the case of lists in the Python design FAQ, and rejected because the list elements could be mutable themselves. This, however, is not a problem for tinyarrays! Immutable objects have other uses beyond dictionary keys, for example as default values for function arguments. This occasional need would be covered by afreeze
method - mutable as default otherwise seems useful. Detail:tinyarray.array
would always make a copy and thus return a new mutable array.tinyarray.asarray
would avoid a copy when possible and possibly return an immutable array.
One would also have to decide what the result should be of an operation involving two mutable arrays, and one that is mutable and one that is not. I would tend to make these immutable, except perhaps in the case when both arguments are mutable.
Views and mutability
Some operations on NumPy arrays create view objects that create a new array object that references (a part of) the data of the original array. Examples of such operations are indexing with slices or the transpose
method. It is possible to modify the original array’s contents through a view. Tinyarray so far does not support indexing with slices, but it has transpose
.
Views introduce a layer of indirection and require the introduction of strides (without strides, transpose
cannot be implemented using views). Tinyarray has so far avoided introducing both which has the benefit of keeping data structures and memory access pattern simple and compact. Both are important for tinyarray’s speed which is of crucial importance.
Anyway, the biggest benefit of views is with large arrays. A PyVarObject
object without any payload consists of five size_t
, that is 40 bytes on 64 bit architectures. Hence, allocating and creating a new array involves a certain ovehead. The data of a small array is typically comparable in size, so that views seem not worth it for typical (i.e. tiny) arrays.
So far, the absence of views does not break compatibility with NumPy: a copy of an immutable array behaves the same as a copy. However, if mutable tinyarrays were to be introduced, we would face the following choice:
-
Accept that tinyarrays behave differently from NumPy arrays when views are involved. This is not unheard of: if
l
is a list,l[:]
creates a view butl[1:]
creates a copy. And tinyarray already deviates from NumPy in other ways: the numerical comparison operators (==
,<
, etc.) work like for tuples: otherwise tinyarrays would not be usable as keys for dictionaries. -
Implement views for tinyarrays. This would have benefits for somewhat larger arrays, but also costs in terms of storage and runtime.
Data layout
Tinyarrays are PyVarObjects
and use a trick to reduce the storage size: the obligatory ob_size
field is used in a clever way:
-
When
ob_size >= 0
, it stores the length of a 1-dimensional array. -
When
ob_size < -1
, its absolute value stores the number of dimensions (ndim
). The shape is stored separately. -
The meaning of
ob_size == -1
isndim = 0
.
This way of storing things has two advantages:
-
In the common case of
ndim == 1
no shape needs to be stored. This saves onesize_t
, i.e. 8 bytes. -
When the method
Array_base::ndim_shape
(see filearray.hh
) returns the number of dimensions and shape, it is able to return a pointer to the shape even in the special casendim == 1
: a pointer to theob_size
field is returned, which corresponds to the shape!
Unfortunately, in the case where we want to store a mutability bit for each array, I do not see how to store it in ob_size
while keeping the above optimization. The problem is not the loss of one bit for ndim
or size, but that there is no way that ob_size
can correspond to the shape for both mutable and immutable arrays.
A practical solution would be to use ob_size
to store ndim
as well as mutability. The shape would then have to be stored separately, also in the case of one-dimensional arrays. That’s probably negligible.
Preliminary synthesis
I’d like to hear what others think is the best way to balance the above constraints.
Currently, I tend to:
-
Types: Do not introduce new types. Store mutability for each new array instance.
-
API: Make each new array initially mutable. Freeze upon first call of
__hash__
or when requested manually through thefreeze
method. -
Views: Do not introduce views. Accept that semantics differs from NumPy (as it does now).
-
Data layout: Accept the additional
size_t
for one-dimensional arrays.
Do you see serious problems caused by the above choices?