Architecture overview
OpenCodeHub turns a source tree into a typed graph that agents can query over MCP. The pipeline has six phases, and each phase has one job. This page is the index. Each section names a phase, states its one job, and links to the page that covers it in depth.
Pipeline at a glance
Section titled “Pipeline at a glance”Fifteen tree-sitter grammars produce a unified ParseCapture stream.
Per-language resolvers turn captures into typed relations. SCIP
indexers (TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin,
Ruby) upgrade heuristic edges to compiler-grade references where
available. The graph persists into LadybugDB by default, with DuckDB
carrying the temporal sibling. Communities and
processes are precomputed. An stdio MCP server with 29 tools answers
agent queries.
Where the data lives
Section titled “Where the data lives”The default backend is LadybugDB, with DuckDB as the temporal
sibling. A single-file DuckDB layout is the opt-in fallback when the
@ladybugdb/core binding cannot load (or when forced via
CODEHUB_STORE=duck). See Storage backend.
Embeddings live in the same physical store as the graph (one
embeddings table, one HNSW index, three granularities keyed by a
granularity discriminator). Findings reuse the nodes table with
kind='Finding'.
The six phases
Section titled “The six phases”1. Parse — source tree to captures
Section titled “1. Parse — source tree to captures”One job: lex every file with its tree-sitter grammar and emit a
ParseCapture[] stream in a unified schema (tag, text, start/end
line+col, nodeType). Lines are 1-indexed, columns 0-indexed.
Fifteen languages are registered via a compile-time exhaustive
satisfies Record<LanguageId, LanguageProvider> table: TypeScript,
TSX, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin,
Swift, PHP, Dart. The runtime is web-tree-sitter (WASM) by default
on both Node 22 and Node 24; the native N-API addon is opt-in.
2. Resolve — captures to typed relations
Section titled “2. Resolve — captures to typed relations”One job: turn captures into typed edges (DEFINES, HAS_METHOD,
HAS_PROPERTY, IMPORTS, EXTENDS, IMPLEMENTS, CALLS,
REFERENCES, TYPE_OF) by resolving names against a per-language
symbol scope.
A three-tier resolver handles the common case (same-file 0.95, import-scoped 0.9, global 0.5). Python and the TS family opt into a stack-graphs backend for tighter cross-module resolution. Heritage linearization is per-language: C3, first-wins, single-inheritance, or no-op.
3. Augment — SCIP indexers upgrade edges
Section titled “3. Augment — SCIP indexers upgrade edges”One job: run each repo’s SCIP indexer, parse the resulting .scip
protobuf, and emit CALLS, REFERENCES, IMPLEMENTS, and TYPE_OF
edges with confidence=1.0 and reason=scip:<indexer>@<version>. The
confidence-demote phase then rescales any heuristic edge the SCIP
oracle contradicts from 0.5 to 0.2.
Pinned indexers cover TypeScript / TSX / JavaScript (scip-typescript),
Python (scip-python), Go (scip-go), Rust (rust-analyzer), Java
(scip-java), C# (scip-dotnet), C/C++ (scip-clang), Kotlin (scip-kotlin),
and Ruby (scip-ruby). Pins live in .github/workflows/gym.yml.
See SCIP reconciliation.
4. Index — BM25, HNSW, and scanners
Section titled “4. Index — BM25, HNSW, and scanners”One job: persist the graph into the selected backend with search indexes wired up.
- BM25 — over symbol names, signatures, and summaries.
- HNSW — filter-aware, with the granularity discriminator pushed into the predicate so all three tiers (symbol / file / community) share one index without recall collapse.
- Multi-hop traversal — Cypher-emitting dialect on the graph
backend; recursive CTEs (
USING KEY) on the legacy DuckDB layout.
Embeddings are optional, gated on PipelineOptions.embeddings. The
backend cascade is SageMaker → HTTP / OpenAI-compatible → local ONNX.
Scanners run separately through the scan MCP tool, merging SARIF
onto disk and indexing findings back into the nodes table.
See Embeddings and Scanners and SARIF.
5. Cluster — communities and processes
Section titled “5. Cluster — communities and processes”One job: group related symbols into communities (Louvain) and walk call chains to produce processes (handler → service → data access). Both are precomputed so MCP tools read them directly.
Symbol-level LLM summaries are produced here when enabled. Summaries are fused into the symbol-tier embedding text at ingestion time (not query time) so retrieval runs against a pre-fused vector.
6. Serve — MCP over stdio
Section titled “6. Serve — MCP over stdio”One job: expose the graph through an stdio MCP server (codehub mcp). Twenty-nine tools, seven resources, zero canned prompts. Every
tool returns a structured envelope with next_steps and, when the
index lags HEAD, a _meta["codehub/staleness"] block. No daemon, no
socket, no remote state.
See MCP overview and MCP tools.
Why this shape
Section titled “Why this shape”OpenCodeHub’s primary user is an AI coding agent that needs callers, callees, processes, and blast radius in one tool call — and needs the answer to be reproducible across runs. The six-phase shape is the cheapest configuration that hits all three:
- Local + offline. The default storage stack is embedded;
codehub analyze --offlineopens zero sockets. - Deterministic. Phases are pure: same inputs → same outputs,
byte-identical
graphHash. ThegraphHashinvariant holds across both the LadybugDB and DuckDB backends. See Determinism. - Apache-2.0, every transitive dep on the permissive allowlist. No BSL, no AGPL, no source-available engines in the core. See Supply chain.
Reference ADRs
Section titled “Reference ADRs”| ADR | Topic |
|---|---|
| 0001 | Storage backend selection — DuckDB + hnsw_acorn + fts (the v1.0 baseline). |
| 0002 | Rust core deferred — v2.0 stays pure TypeScript. |
| 0004 | Hierarchical embeddings — one table, three granularities, filter-aware HNSW. |
| 0005 | SCIP replaces LSP — compiler-grade edges without long-running language servers. |
| 0006 | SCIP indexer CI pins — current version table per language. |
| 0007–0010 | Artifact factory, document pattern, output conventions, dogfood findings. |
| 0011 | LadybugDB (phase-1) — graph-native backend behind the IGraphStore seam. |
| 0012 | Repo as a first-class graph node — repo_uri, group registry, AMBIGUOUS_REPO envelope. |
| 0013 (storage) | Storage default + interface segregation — LadybugDB by default, DuckDB temporal sibling. |
| 0013 (parse) | WASM-default parse runtime on Node 22 and Node 24. |
| 0014 | SCIP REFERENCES + TYPE_OF emission, embedder modelId stamping. |
See ADRs for the full list.
Related pages
Section titled “Related pages”- Monorepo map — every workspace package and what it owns.
- Storage backend — the graph + temporal interface segregation and the resolver.
- Cross-repo federation
—
repo_uri, the group registry, and theAMBIGUOUS_REPOenvelope. - Determinism — the reproducibility contract.
- Supply chain — SBOM, cosign, SLSA L3, license allowlist.