GOJA NLP extraction
Local LLM evaluation for structured information extraction from German job advertisements under privacy and institutional constraints, connected to BIBB-style vocational-data questions.
Problem
$ evaluate --source-backed
The task was to read German job advertisements, extract entities and relations against a fixed schema, produce valid JSON, and do it reliably in a privacy-first setting where data could not casually leave the institution.
This page summarizes existing GOJA project notes. It is not a new benchmark and does not add invented metrics.
Setup
- Task: German job-ad extraction with fixed-schema JSON output.
- Models in the project notes: Gemma 4 27B, Qwen 2.5 7B, Mistral 7B, and Llama 3.2 3B.
- Workflow context: local inference, Ollama, Hydra, MLflow, schema validation, and evaluation reports.
- Publication boundary: BIBB is linked as vocational-data context, not claimed here as an official site partnership.
Findings from the project notes
The project notes frame this as viable for assisted annotation, not autonomous database writes.
The practical lesson: model capacity mattered for this structured extraction task.
Useful enough to support human review, not enough to remove the reviewer.
The draft points toward multi-stage pipelines rather than one-pass extraction.
Prompt ablation lesson
The project notes describe six prompt versions on Gemma 4 27B. Adding annotation guidelines, compound-splitting rules, JSON constraints, and namespace disambiguation created a tradeoff: improvements in extraction quality often reduced parse reliability.
The lesson is not "prompting is useless." It is that complex structured extraction has a ceiling if every requirement is pushed into one generation step.
Failure modes
- JSON corruption from misapplied compound-splitting rules on German slash-words.
- Semantic confusion between entity types and relation types sharing similar vocabulary.
- Namespace bleeding across entity and relation contexts.
These are specific, reproducible failure modes. That changes the intervention strategy: post-processing and staged extraction become more credible than endlessly adding prompt constraints.
Production lesson
Privacy-first local LLMs are not a shortcut around evaluation rigor. Parse rate matters. Schema validity matters. Failure-mode analysis matters. Human-in-the-loop design matters.
The most honest product framing from the notes is assisted annotation: model pre-fills structured output, a human reviewer corrects it, and the system improves throughput without pretending to be autonomous.
Next steps
- Post-processing normalization for known JSON/schema failure modes.
- Multi-stage pipeline: separate entity extraction from relation extraction.
- Rewrite the full public article only after source notes and metrics are confirmed.