Test scan recognition pipeline
Currently there are very limited tests for the scan processing pipeline. tests/test_scans.py
currently only tests whether PNG files containing various datamatrices can be correctly read.
In addition to unit tests for the individual parts of the pipeline (e.g. extract_images
, guess_dpi
, and other functions mentioned in tests/test_scans.py
) we also need integration tests that test the whole pipeline. Here, the "whole pipeline" means that the pages of the scans are correctly normalized (rotation and offsets detected and corrected) and the pages are correctly identified (datamatrices can be read). Note that even if the datamatrices can be read, it's still possible that the rotation and offsets were not properly accounted for, which will have consequences for further parts of the pipeline (student recognition).
The first kind of integration test generates PDFs in a controlled way:
1. start from unmarked PDF
2. run zesje PDF generation to apply cornermarkers and datamatrices to the pages of the unmarked PDF
3. apply some fuzzing transformations (this is to simulate printing/scanning the marked pdf)
4. run the scan processing pipeline and ensure that all the pages are correctly processed and identified
We can imagine providing various kinds of unmarked PDF for step 1 (e.g. blank page, page with some writing etc.), and applying various kinds of fuzzing for step 3 (the simplest being to apply no transformation at all), and testing all combinations thereof (pytest
makes this easy). Being able to identify which combinations are hard for our system to handle is important. Because we control the content of the original PDF and the exact fuzzing applied, it will be easier to pinpoint what sorts of things will be difficult to deal with in the real world.
There are also the tests where we replace steps 1-3 with reading in some pre-prepared scans. This serves as a final check that the system works under actual real-world conditions. We will need to supply the scans, as well as the metadata (exam name, copy number, page number) encoded in the datamatrices of each page to ensure that we have read them correctly (we could also elide this last check and assume that if the datamatrix was read at all, then it was read correctly).
Tasks
-
Create a test function in tests/test_scans.py
that implements the first kind of integration test. It should take the filename of an unmarked PDF, and the fuzzing transform to apply, as parameters (checkout the existing tests to see how to parametrize tests). In addition to testing that the datamatrices can be read at all, we also need to ensure that the detected rotation/scaling/offset is correct (so that student ID and multiple choice questions can be accurately read) -
Generate a few unmarked PDFs (e.g. blank page, example exam) to use with the test function -
Generate a few fuzzing primitives (e.g. rotate by some angle, rescale by some factor, apply some noise) -
Make a few combinations of fuzzing primitives to use with the test function (we don't need all possible combinations, but a few would be good) -
Implement the second kind of test.The test function should take a filename, read a scan from that file (also perhaps metadata from a file with the same name, but a .yml
extension) and test the scanning pipeline against it -
Generate some mock exams to use with the test created in step 5 (IIRC there is already a mock dataset somewhere...)