Engineering note
Making a small local model feel frontier-grade
Updated June 2026
Maali's "Ask anything" feature runs on a small language model on your own computer, not a frontier model in the cloud. This is the story of how it gets frontier-quality answers anyway: by moving the thinking out of the model and into deterministic code, and proving the result with measurement instead of vibes.
The constraint that started it
Maali is local-first. Your financial data is analyzed by a small model (Gemma, roughly 4B parameters) running on your machine, and it never leaves. That rules out the easy path of wrapping a cloud model. So the goal became awkwardly specific: take a model small enough to run on a laptop and make its answers feel like a frontier model's. Not "good for a small model." Actually good.
The privacy contract removed the easy answer, which forced the architecture to carry the quality instead of the model. That turned out to be the whole lesson.
Let the model plan, not compute
A small model is genuinely good at one thing and unreliable at another. It is good at turning an infinite variety of phrasings into a small set of intents, and at writing natural sentences. It is unreliable at arithmetic and at holding many numbers in its head. So we split the work along exactly that line.
The model maps language to intent; code maps intent plus your live data to the answer. For advice, code builds a plan of ranked facts and the model only writes the prose around it. It never picks the numbers and never ranks them.
Guards, because instructions are not enough
The prompt already said "do not invent numbers." The model ignored it, so we stopped asking and started enforcing. The model is handed every figure it is allowed to use, so the set of permitted numbers is exactly the set it was given. Any dollar amount in the prose that is not in that set is flagged, the answer is regenerated once, and if it still misbehaves the offending sentence is removed. The panel of numbers beside the prose is the source of truth; the prose can only ever be a phrasing of it. One rule catches both fabrication and corruption.
From vibes to numbers
We could not improve what we could only feel, so we built the instruments: invariant tests that need no model (the parts of a total must sum to the whole), evaluations that grade real answers against independently recomputed figures, latency ceilings, and a mode that runs the whole suite against real data while emitting only pass or fail, never a figure. Correctness can be proven without anyone, including us, ever seeing the numbers.
The two-judge loop
The most useful change had nothing to do with code. One frontier model implements a change; a second, separate frontier model judges it blind, scoring two unlabeled answer sets against a rubric written down before the work began. To be clear about what these models are: they are graders in development, the teacher in the loop, not the runtime. The shipped product is the small local model alone, and the frontier models never touch your data.
This catches the thing authors always do, score their own work too kindly. In one case an author measured a 2.4x improvement that a blind judge put at 1.26x. When the two agree, you can trust the result; when they diverge, you know where to dig.
Where it ended up
By the time every advice path was plan-backed, the deterministic core was already at frontier, so the last stretch was a different kind of work: not building paths but auditing the finished whole. We graded roughly forty answers spanning every path, blind, against a rubric fixed before each round, three rounds in all, each closing what the last exposed: 8.31, then 8.55, then 8.72 out of 10. The headline is not the top number; it is that by the third round no single answer scored below 8. The floor came up to meet the ceiling.
| Path | Round 3 | Note |
|---|---|---|
| Queries (about 80% of real use) | 8.9 | Correctness is structural; deterministic code does the arithmetic. The routing tail (a wrong period, a comparison read backwards) was found in the audit and closed. |
| Spending advice | 8.6 | Leads with the highest-leverage cut and names the driver behind it, with a one-time-versus-recurring rule, not generic filler. |
| Subscriptions and recurring | 8.6 | Code renders the list as scannable rows; the model writes only a one-sentence lead of judgment over it. |
| Wealth and retirement | 8.5 | Names the real levers and refuses to prescribe a move the data does not support. |
| Forecast | 8.5 | A plan-backed projection with an actionable target. |
| Broad "how am I doing" | 8.7 | A stable, balanced health read; once the most variable path, now one of the most reliable. |
| Declines and out of scope | 8.6 | Refuses a stock pick or a tax strategy and redirects to what it can actually show. |
| Multi-turn follow-ups | 8.1 | A follow-up keeps the prior frame. The lowest path, and the clearest remaining polish target. |
| Adversarial | 9.0 | Injection, gibberish, and out-of-data periods refused cleanly. |
| Latency | much improved | One wiring fix roughly halved advice generation and removed the worst-case waits; the figures panel now paints in under a tenth of a second. |
We stopped at 8.72 on purpose. For a model this small, frontier-grade does not mean prose indistinguishable from a cloud model; it means rock-solid, defect-free, and fast on your own machine. So we raised the floor (no leaks, no invented figures, no misroutes, no answer you cannot read) and left the last fraction of a point, which is prose elegance sitting inside the graders' own margin of noise. What lifted the floor was killing a long tail of small edge defects, the kind that make a person think "this is dumb," which you only find by grading the whole surface, not the part you just tuned. What remains is no longer correctness, safety, or even latency; it is polish we declined to buy by swapping in a bigger model and forfeiting the privacy and speed that are the entire point.
What we would tell the next person
Move the thinking out of the model
Every path we shifted from "model reasons and phrases" to "code plans, model phrases" jumped a notch and shed a class of bug.
Guards beat prompts
"Do not invent numbers" was ignored. A check that strips any unverifiable figure works where the instruction could not.
One source of truth
The figures are canonical; the prose is only a phrasing of them.
Measure, do not assert
"8.7 on the rubric, n=11" is a sentence you can act on. "Feels better" is not.
Separate implementer from judge
Authors over-score their own work. Judge blind, against a rubric fixed in advance.
Constraints can be a gift
Because we could not phone a frontier model, the architecture had to carry the quality, which made it more robust.
In one line
We did not make a small model smart. We built a system that stops relying on the model for the things small models get wrong, and measured the difference honestly. The model maps language to intent and intent to sentences; everything that has to be exact is exact because code, not the model, computes it.
Maali runs all of this on your own machine. Your data never leaves it.
All figures shown are illustrative examples from synthetic demo data, not real financial information. This note describes engineering method, not financial advice.