Voor Joris ide

Python 55.2%
Jupyter Notebook 44.8%

Find a file

Dennis Van Eecke 48af6822ce Add readme		2026-02-22 04:43:39 +01:00
.vscode	WIP	2026-02-22 04:17:17 +01:00
app	WIP	2026-02-22 04:17:17 +01:00
input	Add readme	2026-02-22 04:43:39 +01:00
notebook	Add readme	2026-02-22 04:43:39 +01:00
.gitignore	First commit before implementation with the spec.	2026-02-21 08:56:01 +01:00
config.json	WIP	2026-02-22 04:17:17 +01:00
document_classes copy.json	WIP	2026-02-22 04:17:17 +01:00
document_classes.json	WIP	2026-02-22 04:17:17 +01:00
PLAN.md	Commit before implementing pipeline.	2026-02-21 10:33:04 +01:00
README.md	Add readme	2026-02-22 04:43:39 +01:00
requirements.txt	WIP	2026-02-22 04:17:17 +01:00
run.py	Add models etc	2026-02-21 09:16:43 +01:00
SPECIFICATION.md	First commit before implementation with the spec.	2026-02-21 08:56:01 +01:00

README.md

Certificate Scanning Service

A proof-of-concept FastAPI microservice that uses AI vision models (via OpenRouter) to extract structured data from certificate documents — without OCR.

Why not OCR?

Traditional OCR produces erratic results on documents with tabular layouts, merged cells, and complex formatting — exactly the kind of structure found in industrial certificates. This service takes a different approach:

Find — An image-generation model receives a full page scan and redraws it with a blue rectangle painted over the target section. The rectangle is detected with HSV colour thresholding to obtain pixel coordinates.
Read — The page is cropped to those coordinates and sent to a lightweight multimodal model that extracts key/value data as structured JSON.

Because the finder and reader are separate steps, temporal coherence can be exploited: when the document layout doesn't change between batches, the expensive finder step can be skipped entirely and previous scan coordinates reused. The reader step alone is fast and cheap, especially when the input crop is small and clear.

Architecture

PDF upload
  │
  ▼
pdf_to_image  ──►  page PNGs (at configured DPI)
  │
  ▼
┌─────────────────────────────────────────────┐
│  For each reading zone defined in the       │
│  document class:                            │
│                                             │
│  1. Letterbox page to nearest supported     │
│     aspect ratio (black bars, no distort)   │
│  2. Send to FINDER model (image → image)    │
│  3. Detect blue rectangle (HSV threshold)   │
│  4. Map coordinates back to original space  │
│  5. Crop page to found region               │
│  6. Send crop to READER model (image → JSON)│
│  7. Validate output with dynamic Pydantic   │
│     model built from the zone definition    │
└─────────────────────────────────────────────┘
  │
  ▼
Structured JSON response

Project structure

app/
  main.py            FastAPI app, endpoints, startup seeding
  config.py          Settings loaded from config.json + .env
  models.py          Pydantic models (API schemas, document class definition)
  database.py        SQLAlchemy async engine + ORM tables
  pipeline/
    scan.py          Scanning pipeline (finder + reader orchestration)
    pdf_to_image.py  PDF → PIL images via pdf2image
    util.py          Aspect ratio / resolution helpers

config.json          App configuration (models, DPI, upscale factors)
document_classes.json  Seed data defining document types and reading zones
run.py               Uvicorn entrypoint

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create a .env file with your OpenRouter API key:

OPENROUTER_API_KEY=sk-or-...

Configuration

config.json controls the AI models and image processing parameters:

Key	Description
`finder_model`	OpenRouter model for the finder step (image → image)
`reader_model`	OpenRouter model for the reader step (image → JSON)
`finder_upscale_factor`	Scale factor applied to images before sending to finder
`reader_upscale_factor`	Scale factor applied to crops before sending to reader
`default_dpi`	Resolution for PDF-to-image conversion
`db_path`	Path to the SQLite database file
`debug_output_dir`	Directory for debug image output

Running

python run.py

The server starts on http://127.0.0.1:8000. On startup it creates the SQLite database and seeds document classes from document_classes.json.

API

`GET /health`

Returns {"status": "ok"}.

`GET /document-classes`

Lists all registered document classes and their reading zone definitions.

`POST /read?document-class=<name>&debug=<bool>`

Upload a PDF and scan it against a document class.

document-class (required) — name of a registered document class (e.g. arcelor-gent)
debug (optional, default false) — when true, saves intermediate images (letterboxed input, finder output, reader crop) to debug/scan-<id>/
Body — multipart file upload (file field)

Returns structured JSON with extracted data per page and zone.

Document classes

A document class defines what to look for and read on each page. See document_classes.json for the seed format. Each class contains:

reading_zones — sections to locate on the page, each with:
- finder_prompt — natural language description of the target section
- read_lines — keys to extract, each with a prompt_snippet describing how to find the value and an expected type

Roadmap

This is a proof of concept. Planned improvements:

Temporal coherence — skip the finder when layout hasn't changed; re-find only when reader confidence drops
Confidence scoring — track reading reliability across scans
Batch processing — handle multi-document uploads
Pluggable models — easy switching between AI providers and models