case study / local-first evaluation

GOJA NLP extraction

Local LLM evaluation for structured information extraction from German job advertisements under privacy and institutional constraints, connected to BIBB-style vocational-data questions.

local inferenceschema validationprivacy-aware

Problem

$ evaluate --source-backed

The task was to read German job advertisements, extract entities and relations against a fixed schema, produce valid JSON, and do it reliably in a privacy-first setting where data could not casually leave the institution.

This page summarizes existing GOJA project notes. It is not a new benchmark and does not add invented metrics.

Setup

Task: German job-ad extraction with fixed-schema JSON output.
Models in the project notes: Gemma 4 27B, Qwen 2.5 7B, Mistral 7B, and Llama 3.2 3B.
Workflow context: local inference, Ollama, Hydra, MLflow, schema validation, and evaluation reports.
Publication boundary: BIBB is linked as vocational-data context, not claimed here as an official site partnership.

Findings from the project notes

parse reliabilityGemma 4 27B reached 72% parse reliability.

The project notes frame this as viable for assisted annotation, not autonomous database writes.

small model limitLlama 3.2 3B parsed correctly 34% of the time.

The practical lesson: model capacity mattered for this structured extraction task.

entity quality0.45 entity F1 for Gemma 4 27B.

Useful enough to support human review, not enough to remove the reviewer.

relation extractionRelation extraction remained weak.

The draft points toward multi-stage pipelines rather than one-pass extraction.

Prompt ablation lesson

The project notes describe six prompt versions on Gemma 4 27B. Adding annotation guidelines, compound-splitting rules, JSON constraints, and namespace disambiguation created a tradeoff: improvements in extraction quality often reduced parse reliability.

The lesson is not "prompting is useless." It is that complex structured extraction has a ceiling if every requirement is pushed into one generation step.

Failure modes

JSON corruption from misapplied compound-splitting rules on German slash-words.
Semantic confusion between entity types and relation types sharing similar vocabulary.
Namespace bleeding across entity and relation contexts.

These are specific, reproducible failure modes. That changes the intervention strategy: post-processing and staged extraction become more credible than endlessly adding prompt constraints.

Production lesson

Privacy-first local LLMs are not a shortcut around evaluation rigor. Parse rate matters. Schema validity matters. Failure-mode analysis matters. Human-in-the-loop design matters.

The most honest product framing from the notes is assisted annotation: model pre-fills structured output, a human reviewer corrects it, and the system improves throughput without pretending to be autonomous.

Next steps

Post-processing normalization for known JSON/schema failure modes.
Multi-stage pipeline: separate entity extraction from relation extraction.
Rewrite the full public article only after source notes and metrics are confirmed.