Storing and querying the arxiv data
The current implementation: Feather
In the current implementation the arxiv data is stored per category per year in a feather
-file. This was the initial choice since these files can be loaded into memory very fast. The downside is that the full file has to be loaded if you only want to retrieve a single record. This is not that big of an issue right now: 35 ms load time for a 23 MB file with 19k rows.
Option 1: Memory
Another option is to store all the data in feather files on the disk, but load them into memory in pandas dataframes when the program starts. This means quite a bit of memory usage, but we should be able to keep it under 1 GB. I found that ~200k records take up about 300 MB in memory as pandas dataframes.
Option 2: SQLite
We could store all the data in one big SQLite file. This could be faster for retrieving only a single random row.
Option 3: Other format
If there is a format similar to feather, but with support for indices such that we can randomly read a single record from it, that would be the ultimate solution IMO. Feather itself supports indices, but I don't think the Python/pandas implementation does...