Voor Joris ide
  • Python 55.2%
  • Jupyter Notebook 44.8%
Find a file
2026-02-22 04:43:39 +01:00
.vscode WIP 2026-02-22 04:17:17 +01:00
app WIP 2026-02-22 04:17:17 +01:00
input Add readme 2026-02-22 04:43:39 +01:00
notebook Add readme 2026-02-22 04:43:39 +01:00
.gitignore First commit before implementation with the spec. 2026-02-21 08:56:01 +01:00
config.json WIP 2026-02-22 04:17:17 +01:00
document_classes copy.json WIP 2026-02-22 04:17:17 +01:00
document_classes.json WIP 2026-02-22 04:17:17 +01:00
PLAN.md Commit before implementing pipeline. 2026-02-21 10:33:04 +01:00
README.md Add readme 2026-02-22 04:43:39 +01:00
requirements.txt WIP 2026-02-22 04:17:17 +01:00
run.py Add models etc 2026-02-21 09:16:43 +01:00
SPECIFICATION.md First commit before implementation with the spec. 2026-02-21 08:56:01 +01:00

Certificate Scanning Service

A proof-of-concept FastAPI microservice that uses AI vision models (via OpenRouter) to extract structured data from certificate documents — without OCR.

Why not OCR?

Traditional OCR produces erratic results on documents with tabular layouts, merged cells, and complex formatting — exactly the kind of structure found in industrial certificates. This service takes a different approach:

  1. Find — An image-generation model receives a full page scan and redraws it with a blue rectangle painted over the target section. The rectangle is detected with HSV colour thresholding to obtain pixel coordinates.
  2. Read — The page is cropped to those coordinates and sent to a lightweight multimodal model that extracts key/value data as structured JSON.

Because the finder and reader are separate steps, temporal coherence can be exploited: when the document layout doesn't change between batches, the expensive finder step can be skipped entirely and previous scan coordinates reused. The reader step alone is fast and cheap, especially when the input crop is small and clear.

Architecture

PDF upload
  │
  ▼
pdf_to_image  ──►  page PNGs (at configured DPI)
  │
  ▼
┌─────────────────────────────────────────────┐
│  For each reading zone defined in the       │
│  document class:                            │
│                                             │
│  1. Letterbox page to nearest supported     │
│     aspect ratio (black bars, no distort)   │
│  2. Send to FINDER model (image → image)    │
│  3. Detect blue rectangle (HSV threshold)   │
│  4. Map coordinates back to original space  │
│  5. Crop page to found region               │
│  6. Send crop to READER model (image → JSON)│
│  7. Validate output with dynamic Pydantic   │
│     model built from the zone definition    │
└─────────────────────────────────────────────┘
  │
  ▼
Structured JSON response

Project structure

app/
  main.py            FastAPI app, endpoints, startup seeding
  config.py          Settings loaded from config.json + .env
  models.py          Pydantic models (API schemas, document class definition)
  database.py        SQLAlchemy async engine + ORM tables
  pipeline/
    scan.py          Scanning pipeline (finder + reader orchestration)
    pdf_to_image.py  PDF → PIL images via pdf2image
    util.py          Aspect ratio / resolution helpers

config.json          App configuration (models, DPI, upscale factors)
document_classes.json  Seed data defining document types and reading zones
run.py               Uvicorn entrypoint

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create a .env file with your OpenRouter API key:

OPENROUTER_API_KEY=sk-or-...

Configuration

config.json controls the AI models and image processing parameters:

Key Description
finder_model OpenRouter model for the finder step (image → image)
reader_model OpenRouter model for the reader step (image → JSON)
finder_upscale_factor Scale factor applied to images before sending to finder
reader_upscale_factor Scale factor applied to crops before sending to reader
default_dpi Resolution for PDF-to-image conversion
db_path Path to the SQLite database file
debug_output_dir Directory for debug image output

Running

python run.py

The server starts on http://127.0.0.1:8000. On startup it creates the SQLite database and seeds document classes from document_classes.json.

API

GET /health

Returns {"status": "ok"}.

GET /document-classes

Lists all registered document classes and their reading zone definitions.

POST /read?document-class=<name>&debug=<bool>

Upload a PDF and scan it against a document class.

  • document-class (required) — name of a registered document class (e.g. arcelor-gent)
  • debug (optional, default false) — when true, saves intermediate images (letterboxed input, finder output, reader crop) to debug/scan-<id>/
  • Body — multipart file upload (file field)

Returns structured JSON with extracted data per page and zone.

Document classes

A document class defines what to look for and read on each page. See document_classes.json for the seed format. Each class contains:

  • reading_zones — sections to locate on the page, each with:
    • finder_prompt — natural language description of the target section
    • read_lines — keys to extract, each with a prompt_snippet describing how to find the value and an expected type

Roadmap

This is a proof of concept. Planned improvements:

  • Temporal coherence — skip the finder when layout hasn't changed; re-find only when reader confidence drops
  • Confidence scoring — track reading reliability across scans
  • Batch processing — handle multi-document uploads
  • Pluggable models — easy switching between AI providers and models