case study / local-first evaluation

GOJA NLP extraction

Local LLM evaluation for structured information extraction from German job advertisements under privacy and institutional constraints, connected to BIBB-style vocational-data questions.

local inferenceschema validationprivacy-aware

Problem

$ evaluate --source-backed

The task was to read German job advertisements, extract entities and relations against a fixed schema, produce valid JSON, and do it reliably in a privacy-first setting where data could not casually leave the institution.

This page summarizes existing GOJA project notes. It is not a new benchmark and does not add invented metrics.

Setup

Findings from the project notes

parse reliabilityGemma 4 27B reached 72% parse reliability.

The project notes frame this as viable for assisted annotation, not autonomous database writes.

small model limitLlama 3.2 3B parsed correctly 34% of the time.

The practical lesson: model capacity mattered for this structured extraction task.

entity quality0.45 entity F1 for Gemma 4 27B.

Useful enough to support human review, not enough to remove the reviewer.

relation extractionRelation extraction remained weak.

The draft points toward multi-stage pipelines rather than one-pass extraction.

Prompt ablation lesson

The project notes describe six prompt versions on Gemma 4 27B. Adding annotation guidelines, compound-splitting rules, JSON constraints, and namespace disambiguation created a tradeoff: improvements in extraction quality often reduced parse reliability.

The lesson is not "prompting is useless." It is that complex structured extraction has a ceiling if every requirement is pushed into one generation step.

Failure modes

These are specific, reproducible failure modes. That changes the intervention strategy: post-processing and staged extraction become more credible than endlessly adding prompt constraints.

Production lesson

Privacy-first local LLMs are not a shortcut around evaluation rigor. Parse rate matters. Schema validity matters. Failure-mode analysis matters. Human-in-the-loop design matters.

The most honest product framing from the notes is assisted annotation: model pre-fills structured output, a human reviewer corrects it, and the system improves throughput without pretending to be autonomous.

Next steps