PDF processing using Wand uses a lot of memory
When we fall back to Wand for PDF processing a lot of memory is used. It easily uses 2GB of RAM for a small PDF.
This issue arises because of multiple factors:
- The whole PDF is flattened at once and loaded in memory as a PIL image. A 24 bit RGB image of an A4 at 300 dpi takes up ~26mb of ram.
- Wand's
Image.sequence
creates a copy of the image. - Wand needs to load
MagickWand
andghostscript
resources. - Possible memory leaks
To fix this I'm going to apply the following:
- Split a PDF up in pages before feeding it to Wand (1, 2)
- Use
with
statements everywhere (4)
We need to be extra careful with memory leaks, because in the future we may call extract_image
from the main process instead of a separate celery
worker that can be restarted after x times.