field note / budget build / case study

Best AI Stack for Building on a Budget

What I learned building myberuf with Claude, Codex, ChatGPT, Mistral, Vercel, GitHub, project docs, smoke tests, and a stubborn respect for small vertical slices.

field note myberuf case study builder workflow about 12 min read

I am building myberuf, a German-for-work learning product. The product is situation-first, not grammar-first: interviews, onboarding, workplace pragmatics, AI coaching, mistake review, and browser-local progress. This article is not a generic tool ranking. It is a field note from trying to build a real MVP on a budget.

$ build mvp --budget-conscious
lesson AI agents can move fast when the task is narrow.
risk without guardrails, they can also ship nonsense faster.
system strategy -> task -> execution -> validation -> handover.

1. I am building an MVP on a budget

myberuf exists because a learner can take many B2/C1 German classes and still freeze when the situation becomes real: introducing yourself in an interview, asking a polite question at the end, navigating onboarding, or responding to workplace feedback.

The product tries to turn studied German into retrieval under pressure. It has HR interview simulations, onboarding scenarios, AI coaching, mistake review, progress tracking, workplace pragmatics, and a private beta workflow built around fast preview testing.

I did not start by hiring a full engineering team or pretending I had a polished agency machine behind me. I used AI tools as a practical build team: ChatGPT, Claude, Claude Code, Codex, Mistral, Vercel, GitHub, project reports, handovers, and smoke tests.

Builder note Do not ask AI to build the whole product. Ask it to ship one vertical slice that a real user can try.

2. The real lesson: AI tools need an operating system

The stack matters, but the operating system matters more. A strong model with a vague task can still make a mess. A smaller model with a crisp scope, the right files, explicit non-goals, and a smoke test can be surprisingly useful.

The operating system I keep coming back to is:

01.strategy

Think through the product problem before touching files.

02.task

Turn ambiguity into one reviewable sprint.

03.execute

Use a repo-aware agent for targeted changes.

04.validate

Run checks and manually smoke-test the user flow.

05.handover

Write the memory so the next session starts clean.

That workflow sounds simple. It is also the difference between "AI made a cool demo" and "AI helped me keep building a product without losing the thread."

3. The stack I used

This is not a universal leaderboard. It is the division of labor that has been useful in my current builder workflow.

best fit: strategy + synthesis

ChatGPT / OpenAI

Useful for brainstorming, product framing, content, messy notes, implementation prompts, and turning a long build journey into usable lessons.

best fit: ambiguous reasoning

Claude Opus

Useful when the problem is not just code: pedagogy, product architecture, tradeoffs, and deciding what the product should actually teach.

best fit: implementation

Claude Sonnet / Claude Code

Useful for implementation-heavy work when the slice is scoped and the agent needs to follow existing repo patterns.

best fit: repo-aware execution

Codex / OpenAI

Useful for targeted fixes, validation commands, debugging, and executing from project reports without carrying every old chat forward.

best fit: model layer exploration

Mistral

Promising for myberuf's German coaching layer because coaching quality, language nuance, latency, cost, deployment trust, and European AI context matter.

best fit: shipping + rollback

Vercel + GitHub

Preview deployment, branches, commits, and rollback points turned AI work into something testable instead of just impressive in a chat.

Local and open models matter too. In the GOJA NLP project, local LLMs were evaluated for structured information extraction from German job ads under privacy and institutional constraints. Local inference, Ollama, schema validation, and evaluation reports are a different kind of learning lab from myberuf, but the lesson is related: tool choice is task choice.

4. What each tool is good for

The practical split is this: use stronger reasoning models when the work is ambiguous, and use repo-aware execution tools when the task is clear.

For myberuf, Claude and ChatGPT were useful for strategy, pedagogy, product thinking, and implementation prompts. A lot of the important work was not "write code"; it was deciding that Module 1 should feel like surviving an HR interview, not taking a grammar lesson.

Codex was useful when the target was concrete: inspect these files, fix this validation path, run these checks, update this handover, do not touch unrelated code. Claude Code and Sonnet were useful for implementation-heavy work when the product decision was already made.

Mistral is relevant for my current model layer, especially because myberuf is language-heavy and German workplace coaching has to be natural enough to be useful. The model layer is not only about raw intelligence; it is about coaching quality, language nuance, latency, cost, deployment trust, and whether the surrounding AI infrastructure fits the product's context.

That is why Mistral is interesting here. myberuf is a German/European workplace-language product, so European AI infrastructure and enterprise-trust questions are not decoration. They are part of the long-term product environment. Mistral fits naturally into a multi-model builder stack: not as a replacement for every other tool, but as a serious candidate for the coaching flow.

I am not claiming Mistral is objectively best. I am saying it is promising for this use case and relevant for European/privacy-conscious AI workflows. The practical builder lesson is to choose models by job-to-be-done, not by hype.

Model selection rule Use the strongest model for ambiguity, not for every small task. Once the decision is clear, make the implementation slice narrow.

5. The handover system

The most underrated part of the stack is not a model. It is the handover.

When a project moves between ChatGPT, Claude, Claude Code, Codex, terminal sessions, project reports, and task files, the bottleneck is no longer only model capability. The bottleneck is whether the next agent receives the right context without dragging in irrelevant history.

In myberuf, files like progress.md, next_sprint.md, weekly handover notes, smoke checklists, and rollback notes became the shared memory layer. They reduced context pollution, token waste, repeated work, and stale assumptions.

A good handover says what changed, what files were touched, what checks ran, what was manually tested, what risks remain, what to do next, and what not to start.

$ handover --to next-agent
include goal, files touched, checks, smoke results, risks, rollback point
exclude secrets, stale chat history, unrelated product dreams
result less guessing, fewer repeated fixes, cleaner continuation

6. The Vercel and secrets lesson

Vercel made preview deployment fast. That mattered because several myberuf issues only became visible in a real browser: scroll behavior, confusing beta signup copy, and UI options that created expectations the model could not meet.

But fast deployment does not remove security basics. Hosted model calls require careful environment variable handling. API keys belong on the server, not in browser-exposed code. .env.local stays out of git. Vercel environment variables need to be set deliberately. Logs should be inspected without exposing secrets.

Security note Do not paste secrets into AI chats. Do not expose API keys client-side. Commit before major AI changes so rollback is easy.

7. Automation lesson: voice changes the latency requirement

One myberuf experiment explored a more conversational direction using Vapi: what if the learner could speak to the platform instead of only typing?

That changes the product requirements. For text-based coaching, a short delay can be acceptable. For voice-based learning, latency becomes part of the simulation. If the response lags, the learner no longer feels like they are practicing a real workplace conversation.

$ voice-practice --latency-matters
goal spoken workplace practice should feel conversational.
risk slow responses break the illusion faster than in typed coaching.
question which model is good enough, fast enough, affordable enough, and trustworthy enough?
$ source --voice-experiment
Vapi voice experiment

Source video from the myberuf voice/automation experiment. The public page links to Loom instead of relying on an iframe player that can fail in local preview.

This is another reason model and infrastructure choice matters. The question is not only "which AI is smartest?" It is "which model fits the interaction?" For a European German-for-work product, Mistral is strategically interesting because model-layer choices affect latency, trust, deployment fit, cost, and privacy expectations. That is not a claim that Mistral is universally faster or best. It is a job-to-be-done argument.

8. The guardrails lesson

AI products need deterministic checks before expensive or subjective LLM evaluation.

myberuf taught this the uncomfortable way. Inputs like test, asdf, or weiss nicht should not advance an interview, call the LLM, or save to a mistake bank. That needs a cheap guard before the model is asked to coach anything.

Later, the same lesson appeared in a harder form: long abusive, non-German, or nonsense answers can still reach coaching if validation is too weak. If the coaching UI always expects "what was good," the model can become too generous. Sometimes the honest answer is: this was not a meaningful attempt.

Session 18 hardened the beta with a progress page, content validation, retry-coach rate limiting, cleanup, and clearer navigation. Session 19 fixed active-beat scrolling and produced the Round 2 plan. Session 19B shipped the optional /modules/1/exercises?round=2 slice with its own progress key, profile.modules["1-round2"], so it would not corrupt Module 1 or Module 2 progression. Session 20 added request guards, clarified beta copy so email capture did not sound like login, refined scroll behavior, and added a tiny Rueckfragen slice.

Guardrail rule Never ship AI-generated evaluation without invalid-input handling, request limits, and a path for "no meaningful attempt."

9. What I would tell another builder

Start smaller than you want to. Pick one painful user situation. For myberuf, that was not "learn German"; it was "survive a German workplace situation without freezing."

Then ask AI for one slice. Not a platform. Not a complete product. One flow that can be tried, broken, improved, and documented.

Separate product thinking from repo execution. Let ChatGPT or Claude help you clarify the product decision. Let Codex or Claude Code execute the specific file-level task. Then smoke-test the result yourself.

End every session with a handover. A project built with AI agents is not one heroic prompt. It is a chain of scoped decisions.

10. Practical checklist

Before asking an AI agent to build

  • Write the user situation in one sentence.
  • Define the smallest useful vertical slice.
  • List the files the agent should inspect.
  • List explicit non-goals.
  • Commit or identify a rollback point.
  • Decide which checks and smoke tests must run.

Before showing users

  • Run the app in a real browser.
  • Try nonsense input and malicious-looking input.
  • Confirm invalid input does not call expensive LLM paths.
  • Check beta copy does not imply features that do not exist.
  • Confirm secrets are server-side and out of git.
  • Write the handover for the next session.

11. AI amplifies clarity or confusion

Building on a budget is now possible in a way that would have sounded unrealistic a few years ago. But it is not automatic.

AI does not remove the need for product judgment. It increases the value of clear thinking. If the task is vague, AI amplifies confusion. If the task is scoped, validated, and handed over properly, AI becomes a real building system.

The best AI stack is not just a list of tools. It is a way of working: strategy -> task -> execution -> validation -> handover.

Use the operating system, not just the tools.

Start with one narrow user situation, keep the build reviewable, and make the next AI session small enough to validate.