Determinism contract
OpenCodeHub makes one load-bearing promise to agents and humans alike:
identical inputs produce a byte-identical graph hash. If you
analyze the same commit twice on the same machine — or on a different
machine with the same toolchain — you get the same graphHash. That
is the determinism contract.
Why it matters
Section titled “Why it matters”Three concrete reasons:
- Reproducibility. An agent that reports a blast radius at
graphHash=abc123and a human reviewer who re-runscodehub analyzeshould see the same graph. If the hash diverges, the agent’s claim is not auditable. - Cache-safety.
codehub statusand CI runners assume that two analyze runs at the same commit have the same output. Without determinism, incremental caches would drift silently and staleness detection would get unreliable. - Regression testing. Every
featorrefactorthat touches the ingestion pipeline has to demonstrate it did not move the hash unintentionally. Determinism makes that assertion possible in one line of CI.
What “inputs” means
Section titled “What “inputs” means”An input is:
- Source tree contents at the current commit.
- Toolchain versions (Node 22 or 24, pnpm 10.x, tree-sitter grammars
pinned in
packages/ingestion/package.json, SCIP indexer versions pinned in.github/workflows/gym.ymlper ADR 0006). - OpenCodeHub version (the monorepo version pinned in
release-please). - Any user-supplied configuration (AGENTS.md overrides,
.codehub/config).
Anything outside that list — wall-clock time, process ID, file-system inode ordering — must not influence the hash. The ingestion phases are pure: inputs in, relations out, no ambient state.
The graphHash invariant is backend-independent. A repo indexed
into LadybugDB (graph.lbug) and the same repo indexed into the
single-file DuckDB layout (graph.duckdb) at the same commit produce
the same hash. A parity gate in CI compares the two hashes on every
PR that touches the storage layer.
How we test it
Section titled “How we test it”Acceptance gate 6 is the regression test. It:
- Copies a fixture repo into two temp directories.
git init+ commit each (identical tree → identical commit hash).- Runs
codehub analyze --force --skip-agents-mdagainst each, capturing the printedgraphHash. - Asserts the two hashes are byte-identical.
If the hashes diverge, the gate fails and the acceptance run exits
non-zero. See scripts/acceptance.sh gate 6 for the exact script.
Two adjacent gates reinforce the contract:
- Gate 10 — embeddings determinism. Runs the same double-analyze
with
--embeddings. Skipped if model weights are not present locally. Advisory-only today because embeddings do not yet propagate into the headlinegraphHash; the gate prints the hashes so a reviewer can spot drift manually. - SCIP indexer regression CI (
.github/workflows/gym.yml). Bit-exact re-invocation of the pinned SCIP indexers against the frozen baseline. Catches drift introduced by an indexer bump before it lands inmain.
Full analyze and incremental re-analyze at the same commit must produce identical hashes (this is asserted explicitly in the determinism CI gate, not just on a clean tree). That is the “full vs incremental byte-identical” invariant called out in ADR 0002.
The --offline contract
Section titled “The --offline contract”codehub analyze --offline is a separate but related guarantee:
zero sockets opened during the run. The flag disables every
non-filesystem I/O path in the pipeline (no SCIP indexer downloads,
no remote embedder, no Bedrock summarize calls).
“Zero sockets” is the literal, measurable claim. It is testable by
running under strace -e connect or the equivalent on macOS
(dtruss); a socket attempt is a bug.
Why it matters: OpenCodeHub is local-first. Your code never leaves
your machine by default. The --offline flag makes that an enforceable
contract for users who need to prove it.
Sources of non-determinism we actively guard against
Section titled “Sources of non-determinism we actively guard against”Ingestion phases are reviewed for the usual suspects:
- Set / map iteration order. All emitted records are sorted by a
stable key before being persisted. Providers that emit
extractPropertyAccessesmust return records sorted by(enclosingSymbolId, propertyName, startLine)— see theLanguageProviderinterface docstring. Date.now(),crypto.randomUUID(), anyMath.random(). Banned in ingestion code. The graph-hash computation uses content hashes, never timestamps.- File-system walk order.
readdirresults are sorted by byte value before dispatch. - Parallel worker output ordering. Worker pools emit into per-worker buffers that are concatenated in deterministic file order at join time.
A fresh contributor reviewing a PR that adds a new phase should ask: “If I ran this twice on the same commit, would I get the same bytes?” If the answer is not obviously yes, the phase is wrong.
Related
Section titled “Related”- ADR 0001 — Storage backend — “Deterministic writes given identical INSERT order” is a listed positive of DuckDB vs. engines with random header UUIDs.
- ADR 0002 — Rust core deferred —
calls out the “full vs incremental
graphHashbyte-identical” determinism CI gate explicitly. - Contributing overview — Tenets —
“Determinism is non-negotiable” is the first tenet in
CONTRIBUTING.md. scripts/acceptance.shgate 6 — the runtime regression test.