Research

Scaling Trail from 200 to 100,000 Neurons: An Engineering Note

A compile-time knowledge engine has three workloads that fight for compute — ingest, lint, and the curation queue. Each one breaks at a different corpus size, for different reasons, and requires a different fix. This is the long-form companion to Work That Fits in a Night: the full accounting of where Trail's bottlenecks are, when they hit, and what a deployment looks like at 200, 8,000, 25,000, and 100,000 Neurons.

Why this note exists

Trail is a compile-time knowledge engine. Every source it ingests is read by a language model and turned into a persistent graph of Neurons — short wiki pages that cross-reference each other and the raw material they came from. The engine does three kinds of background work to keep that graph honest: it ingests new sources into Neurons, it lints the existing corpus for contradictions, orphans, and staleness, and it queues candidate edits for a human curator to approve.

All three workloads scale differently. Ingest is largely linear in the size of the incoming source. Lint is quadratic if you let it be — each new Neuron wants to be re-checked against every existing peer. The queue is bounded by the number of decisions a human can make per day, which does not change no matter how large the corpus gets.

At 200 Neurons none of this matters. A full lint pass finishes in twenty minutes. Ingest runs per-source without thinking about it. The queue holds a handful of candidates. Every shortcut in the engine falls inside a comfortable budget.

At 100,000 Neurons the same code, run the same way, doesn't finish. The scheduler overlaps with itself. The LLM token spend outruns the pricing model. The curator drowns.

The gap between "works" and "doesn't work" is not a cliff. It's a sequence of bottlenecks that arrive in a predictable order, each one requiring a distinct architectural response. This note is the long-form companion to Work That Fits in a Night — which covers one of those bottlenecks in detail. Here we take the full accounting: what breaks, when, and how the engine is built to absorb it without a rewrite.

The three workloads

Ingest

Ingest turns a source into Neurons. A PDF gets parsed, chunked, described page-by-page by a vision model if it has images, and then handed to a primary LLM that extracts entities, writes or updates Neuron pages, and emits backlinks. The work per source is proportional to the source — a ten-page article costs about one minute of wall-clock time on Sonnet; a two-hundred-page book costs about fifteen.

What ingest does not scale with is corpus size. Processing page 47 of a new PDF does not require Trail to consult the other 99,953 Neurons in the base. It only consults the ones the LLM thinks are relevant, which in practice is five to twenty per compile step. That bound is set by Trail's prompt, not by the corpus.

This means ingest is the cheap workload. Double the corpus, ingest cost stays flat per source. The bottleneck is always "how many sources per hour can I afford in LLM spend," never "how expensive is this particular source because the corpus is big."

Lint

Lint is the nightly pass that keeps the graph internally consistent. It runs three checks: contradictions (does this Neuron disagree with any of its near-peers?), orphans (does this Neuron still have a source that supports it?), and staleness (has this Neuron gone untouched longer than the KB's freshness window?).

Orphan and stale detection are pure SQL. They cost milliseconds even on a corpus of a hundred thousand Neurons. There is nothing to ration.

Contradiction detection is the expensive one. For each Neuron under inspection, Trail pulls the top-K nearest peers via FTS5, then asks a small language model whether any pair contradicts each other. K is five by default. Each LLM call costs one to three seconds on Haiku. A full pass at 8,000 Neurons runs about seventeen hours — longer than the night it has to finish in.

That is the scan wall. It is covered in detail in Work That Fits in a Night; the short version is that Trail solves it by capping the per-pass budget and weighting the cap sixty-forty toward recent edits. Every Neuron gets re-scanned on average every two to three weeks; recent edits get re-scanned inside twenty-four hours; the scheduler finishes its pass in roughly an hour regardless of how big the KB is.

Queue

The curation queue is where LLM proposals wait for human approval. Every Neuron write, every contradiction finding, every chat answer flagged as worth saving goes through it. The queue is the one workload whose bottleneck is not compute — it is curator attention.

A single human can make roughly fifty to two hundred approve/reject decisions per hour in a focused session. That ceiling does not change when the corpus grows. At 200 Neurons the queue sees a handful of candidates a day and the curator handles them over coffee. At 100,000 Neurons the raw LLM activity would generate thousands per day if it were emitted without triage — which would be useless, because the curator can only process a few hundred.

The queue's scaling problem is therefore one of surfacing, not throughput: which candidates are worth the human's time, and which ones can be auto-merged, auto-deferred, or filtered entirely. The queue has to shrink as the corpus grows, in real terms per-curator-hour, without dropping the material that actually needs a human.

Each of these three workloads breaks at a different scale. The rest of this note walks through four representative corpus sizes — 200, 8,000, 25,000, and 100,000 Neurons — and describes what each workload looks like at that scale, what infrastructure supports it, and what has to change at the next tier.

FOUR TIERS ON A LOG AXIS 200 8,000 25,000 100,000 NEURONS IN CORPUS i l q BASELINE i l q SAMPLER i l q ROLES i l q FEDERATED LEGEND i — ingest · l — lint · q — queue default path sampler / bounded pass amber ticks — new behaviour added at this tier
Four representative corpus sizes on a log axis. Each tier stacks its workload state — ingest, lint, queue — above the corpus circle, with amber marks where a workload's behaviour has changed from the previous tier. Infrastructure sits at the top as a growing schematic Machine.

200 Neurons — the single-tenant zero-friction baseline

This is Sanne Andersen's clinic, right after first-week onboarding. A couple of hundred protocol PDFs, patient-case notes, and treatment summaries have been ingested. The whole KB weighs about 12 MB on disk including the FTS5 index.

Ingest. Each new source drops into the pipeline and compiles within two to four minutes. There is no queue of pending ingests — one source goes through, then the next. LLM cost per source is negligible against a flat Pro-tier subscription: Haiku for extraction, Sonnet for the larger compile steps, costs measured in single cents.

Lint. The contradiction sampler's cap is set to 500, but the corpus only has 200 Neurons — so the cap is never hit, and the nightly pass just scans everything. The full pass takes about twenty minutes. Orphan and stale detection run in under a second.

Queue. The curator sees three to eight candidates a morning, mostly from ingest. The contradiction detector occasionally flags something real — a new protocol PDF disagreeing with a pre-existing one. The curator resolves the entire queue in fifteen minutes over the first coffee of the day.

Infrastructure. A single Fly.io shared-cpu Machine (256 MB RAM, 1 GB volume) running the Trail server with an embedded libSQL file. No caching layer, no queue broker, no object store — the filesystem holds everything. Monthly cost to operate: roughly $5 of compute plus whatever the LLM spend comes to, which for a clinic at this size is under $30 a month.

This scale is the default Trail deployment. It is what the Hobby plan targets. The engine doesn't need to be tuned to fit here — every default is already correct.

8,000 Neurons — where the first wall lives

This is the midpoint of a serious working KB. FysioDK Aalborg, a couple of years into daily use, runs here. A mid-sized engineering team's internal RFC archive plus incident log plus runbooks lands here. Trail's own self-hosted KB of its design docs and code-review conversations sits here.

Ingest. Same as 200 Neurons, because ingest does not scale with corpus size. A source is a source; the bill per source is the same.

Lint. This is where the scan wall would hit in a naive implementation. A full contradiction pass over 8,000 Neurons would cost sixty thousand LLM calls at 1.5 seconds each — seventeen hours, more than the night the scheduler has. Left alone, passes would overlap, then stack, then the lint service would quietly stop functioning.

This is where the 60/40 sampler takes over. The scheduler caps each nightly pass at 500 Neurons, spends 300 of them on the most-recently-updated pages and 200 on a uniform random sample of the rest. The pass finishes in about seventy-five minutes on Haiku. Every Neuron gets scanned on average every sixteen nights; recent edits get re-scanned within the same day.

The practical effect: the curator never sees a reduction in contradiction-finding quality as the corpus grows from 200 to 8,000. They see the same one or two findings a morning, drawn from the same nightly-pass machinery, with the same latency against recent edits. The corpus has grown by 40x but the curator's queue looks identical.

Queue. Around 8,000 Neurons, the queue starts to require triage on the candidate side, not just on the curator side. Trail ships a source filter and a connector filter that let the curator grab, say, "only Slack-originated candidates from this week" or "only the ingest outputs from the batch I just fed the engine." Those filters don't matter at 200 Neurons. At 8,000 they are how the curator stays oriented.

Auto-approval becomes relevant here too. Candidates above a tuned confidence threshold (the S6 onboarding step shows a default of 0.85) can be auto-merged into the wiki without human review. At the 8k scale, roughly sixty to seventy percent of routine ingest candidates clear this bar. The curator only sees the uncertain ones, which is where their attention is most valuable.

Infrastructure. Still a single Fly.io Machine, but the tier has moved to performance-2x (2 shared CPU, 2 GB RAM, 5 GB volume). The libSQL file is around 450 MB including indices. Nightly LLM spend for lint plus any reactive emissions lands in the $40–80 range for a single-tenant deployment at this size. Ingest spend sits on top, varying by how actively the tenant is feeding in sources.

This is the largest corpus size where a single-node, single-database deployment is still the obviously-right answer. Everything past this point starts forcing architectural decisions.

25,000 Neurons — the second wall, and why it isn't about lint

A large research lab's literature archive lands here. A documentation-as-code org with several years of daily ingest hits this range. A personal knowledge base that has swallowed a decade of notes reaches it too.

Ingest. Still flat per source. The ingest pipeline doesn't care.

Lint. The 60/40 sampler continues to hold. At 25k Neurons the nightly cap is bumped up slightly (TRAIL_CONTRADICTION_SAMPLE_SIZE=1200 is a reasonable value for parallel-Haiku lanes) but the architecture is unchanged. The full-tail pass now takes about three weeks of real time, which is still the right scale — a Neuron untouched for three weeks is a Neuron nobody is modifying.

Queue. This is where the second wall lives, and it is the one people consistently underestimate.

At 25,000 Neurons, raw candidate volume — ingest outputs plus lint findings plus chat-save suggestions — is high enough that even with auto-approval and filters, the curator's work per day has crossed the threshold where a single person can't keep up solo. The compounding pressure is this: a contradiction finding at 25k Neurons often references two other Neurons, which the curator needs to read to judge the finding. So a five-minute decision becomes a twenty-minute decision. A dozen findings a day becomes four hours of real work.

Trail's response at this scale is not to generate fewer candidates — that would mean the engine is silently letting things drift. It is to introduce roles: an admin who onboards the corpus and sets policy, curators who approve material in their domain, and readers who use the wiki without write access. The curation work gets distributed. A lab with five researchers can assign contradiction findings by topic, so each curator sees only the twenty or thirty a day that fall in their area.

The RBAC layer (F28, S5 in onboarding) is designed for this. It is not a compliance feature; it is a scaling feature. Without it, a 25k-Neuron corpus is bottlenecked on one human.

Infrastructure. Still feasible on a single node, but the Fly.io Machine is now performance-4x (4 dedicated CPUs, 4 GB RAM, 20 GB volume). The libSQL file is around 2 GB. FTS5 query performance on the contradiction-candidate lookup starts to show in p95 latency if indexes aren't well-tuned — the engine ships with explicit indexes on kbId + updatedAt and kbId + kind precisely because those are the hot paths at this scale.

The single-node architecture is still correct. What's starting to matter is not the machine size; it's the operational cost of one machine being the only machine. Backups become non-optional. The nightly libSQL export gets pushed to R2 with a 30-day retention. The alerting story (is the scheduler finishing its passes? is the queue draining? is disk approaching the volume limit?) becomes a first-class concern rather than something you check manually every few days.

100,000 Neurons — the tier that rewrites the engine

This is a speculative scale. Trail has not shipped a tenant this large yet. The following is the engineering analysis of what would need to change if one arrived, based on where the bottlenecks lie and what the code currently assumes.

Ingest. Still roughly flat per source, but the prompt-context strategy starts to matter. At 100k Neurons, the "find the relevant peers before writing this new page" step — which uses FTS5 — returns more results than fit comfortably in a prompt. The engine has to decide which ten to send to the LLM, which is itself an optimization problem. In practice this is handled by scoring candidates on recency plus FTS5 rank plus graph distance; the default weighting works at this scale but may need tuning per tenant.

Lint. The sampler architecture continues to hold, but the sample size has to go up. The 60/40 split across 100k Neurons at 500-per-pass means the long tail gets scanned once every six to eight months. That is too slow — an old Neuron that disagrees with a new one should surface in weeks, not seasons. The practical configuration at this scale is TRAIL_CONTRADICTION_SAMPLE_SIZE=3000 run across two or three parallel Haiku lanes, which brings the tail-coverage interval back down to about three weeks.

This is where the LLM token budget starts to dominate everything else. At 3,000 scanned Neurons per night times K=5 peers times roughly 250 tokens per contradiction check, a single tenant is spending around 3.7 million tokens per night on lint alone. At Haiku's pricing this is in the $30–40-per-night range per tenant, which is real money but still the right order of magnitude against a Business-tier subscription. The cost remains linear in sample size, not in corpus size — which is the whole point of the sampler.

Queue. The roles model developed at 25k Neurons has to be taken further. A 100k-Neuron deployment almost certainly has subject-matter silos — a legal KB of this size has a contracts group, a litigation group, and a compliance group, each with its own curator. Candidates need to be auto-routed to the correct reviewer based on the affected Neurons' tags or topic clusters. The curator-assignment logic is not in the engine today; a deployment at this scale would need it as a custom extension, probably as a pre-queue filter that writes assignedTo on the candidate record.

The other thing that emerges at this scale is summarization within the queue. Instead of showing the curator every contradiction finding raw, Trail can group findings by topic and present them as "here are the twelve findings about hiring policy from this week, most of which turn on the same underlying disagreement about equity vesting." This is a UI feature more than an engine feature, but it requires the engine to surface the right grouping signal — which at 100k Neurons means tracking finding-clusters across the nightly passes.

Infrastructure. Single-node is still viable — a performance-8x Fly.io Machine with a 50 GB volume can host this — but the operational story is no longer "a database file on a machine." Backup cadence needs to be sub-daily. The libSQL file is around 8 GB, which is fine for reads but long for a single backup cycle; incremental backups via libSQL's native replication become the right answer. An embedded read-replica on a second Machine is a cheap insurance policy against the primary going down during business hours.

At this scale, the argument for migrating to a split-compute architecture (separate ingest workers, separate lint workers, a shared database) is tempting but probably wrong. The ingest pipeline is embarrassingly parallel already — Fly.io's per-region auto-scaling can spin up additional ingest workers on demand, and they can write to the shared libSQL over the Turso-compatible network protocol without a full application rewrite. The lint pass is already bounded by the sampler cap, so throwing workers at it doesn't help. The database is the only truly stateful component, and at 8 GB it's still inside the comfort zone of a single embedded node.

The architecture that shows up around 100k is therefore: one primary Fly.io Machine running the server and the embedded libSQL, one read-replica in a different region for failover, a pool of ingest workers in the primary region that compile sources into the primary DB, and the usual Trail scheduler (unchanged from 8k) running the sampler on the primary.

Decision matrix — what to worry about, when

WHAT TO WORRY ABOUT, WHEN TIER INGEST LINT QUEUE INFRA 200 Neurons default default default shared-cpu single Machine 8,000 Neurons default sampler enabled cap = 500 / night filters matter + auto-approval perf-2x + backups 25,000 Neurons default + parallel Haiku cap = 1,200 roles required RBAC · per topic perf-4x + monitoring 100,000 Neurons context-select tuned cap = 3k · multi-lane ~3-week tail coverage auto-routing + grouping perf-8x + read-replica
The same tiers as a decision matrix. Amber cells mark the tier where each workload's strategy changes; charcoal cells mean the previous tier's setting still applies. Ingest is the only column that never changes until the very last tier.

The headline: ingest never becomes the bottleneck. Lint becomes the bottleneck around 8k and is handled by the sampler forever after. The queue becomes the bottleneck in a way that is not about engine compute — it is about the human attention budget, which requires organizational structure rather than faster code. Infrastructure follows a predictable ladder, and migration between tiers is a vertical scale-up, not a rewrite.

What breaks honestly

There are things this analysis does not address because Trail has not hit them yet.

Multi-KB joins. The numbers above assume a single KB per tenant. A deployment with dozens of KBs per tenant — a university running a separate KB per department, say — has not been profiled. The scheduler runs per-KB, which means twenty KBs of 5,000 Neurons each cost twenty times the single-KB lint budget. This is linear and bounded, but it changes the pricing conversation.

Federation. Chat queries that should span KBs ("what do we know about frozen shoulder across the clinic KB and the research-paper KB?") require FTS5 lookups against multiple databases plus a cross-KB synthesis prompt. The engine has the primitives for this (the chat layer already handles multi-KB selection) but the token cost at large combined-corpus sizes has not been measured against real workloads.

Vector search at scale. Trail uses libSQL's native vector extension for certain spreading-activation queries, but the current indices are designed for single-KB corpora under 50k Neurons. At 100k+ inside a single KB, the vector index performance story needs validation. FTS5 remains the primary retrieval path and is well-understood at this size.

Curator-cluster dynamics. The roles + auto-routing + grouping UI described for 100k Neurons is engineering the engine can support, but the workflow design — how three curators collaborate on overlapping findings without stepping on each other — has not been exercised at this scale. This is the class of problem that surfaces only with a real tenant in the seat.

None of these are blockers. They are the edges where the analysis stops being grounded in observed behaviour and starts being projected from the architecture.

The through-line

Trail's scaling strategy is the same at every tier: decide what work has to be done, decide what part of that work can be bounded, bound it, and let the rest grow freely. Ingest is unbounded because it's cheap per unit. Lint is bounded by the sampler. Queue is bounded by auto-approval plus curator distribution. Infrastructure is bounded by the next Fly.io Machine size in the ladder.

The engine does not scale by becoming cleverer as the corpus grows. It scales by refusing to become more expensive than it was at the last tier, and letting the sampler — or the role structure, or the auto-router — absorb the difference between what would cost and what actually costs.

That choice has a consequence worth naming: at every scale, a small amount of work is being skipped. Not skipped silently — the sampler's bias guarantees recent edits get scanned inside a day and everything else gets scanned inside a month. But it is skipped, in the sense that the engine is not doing the O(N²) pairwise check that would nominally be "correct." The bet is that the pattern the sampler preserves — recency-biased with a long-tail reserve — is good enough because it mirrors how biological memory consolidates.

Sleep is bounded. Memory is bounded. The wiki, compiled continuously, is also bounded. That is not a limitation. That is what makes it keep working.


See also

← MORE FROM RESEARCH