Skip to content

Architecture overview

OpenCodeHub turns a source tree into a typed graph that agents can query over MCP. The pipeline has six phases, and each phase has one job. This page is the index. Each section names a phase, states its one job, and links to the page that covers it in depth.

Fifteen tree-sitter grammars produce a unified ParseCapture stream. Per-language resolvers turn captures into typed relations. SCIP indexers (TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin, Ruby) upgrade heuristic edges to compiler-grade references where available. The graph persists into LadybugDB, with DuckDB carrying the temporal sibling. Communities and processes are precomputed. An stdio MCP server with 28 tools answers agent queries.

The graph tier is always LadybugDB (graph.lbug); the temporal tier is always DuckDB (temporal.duckdb). Both files live under .codehub/. There is no selection knob, no probe, and no fallback — if the @ladybugdb/core binding cannot load, open() throws GraphDbBindingError and the operation aborts. See Storage backend.

Embeddings live in the same physical store as the graph (one embeddings table, one HNSW index, three granularities keyed by a granularity discriminator). Findings reuse the nodes table with kind='Finding'.

One job: lex every file with its tree-sitter grammar and emit a ParseCapture[] stream in a unified schema (tag, text, start/end line+col, nodeType). Lines are 1-indexed, columns 0-indexed.

Fifteen languages are registered via a compile-time exhaustive satisfies Record<LanguageId, LanguageProvider> table: TypeScript, TSX, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, Swift, PHP, Dart. The runtime is web-tree-sitter (WASM) — the only parse runtime on Node 20, 22, and 24. There is no native parser and no opt-in (ADR 0015).

See Parsing and resolution.

2. Resolve — captures to typed relations

Section titled “2. Resolve — captures to typed relations”

One job: turn captures into typed edges (DEFINES, HAS_METHOD, HAS_PROPERTY, IMPORTS, EXTENDS, IMPLEMENTS, CALLS, REFERENCES, TYPE_OF) by resolving names against a per-language symbol scope.

A three-tier resolver handles the common case (same-file 0.95, import-scoped 0.9, global 0.5). Python and the TS family opt into a stack-graphs backend for tighter cross-module resolution. Heritage linearization is per-language: C3, first-wins, single-inheritance, or no-op.

See Parsing and resolution.

3. Augment — SCIP indexers upgrade edges

Section titled “3. Augment — SCIP indexers upgrade edges”

One job: run each repo’s SCIP indexer, parse the resulting .scip protobuf, and emit CALLS, REFERENCES, IMPLEMENTS, and TYPE_OF edges with confidence=1.0 and reason=scip:<indexer>@<version>. The confidence-demote phase then rescales any heuristic edge the SCIP oracle contradicts from 0.5 to 0.2.

Pinned indexers cover TypeScript / TSX / JavaScript (scip-typescript), Python (scip-python), Go (scip-go), Rust (rust-analyzer), Java (scip-java), C# (scip-dotnet), C/C++ (scip-clang), Kotlin (scip-kotlin), and Ruby (scip-ruby). Pins live in .github/workflows/gym.yml.

See SCIP reconciliation.

One job: persist the graph into LadybugDB with search indexes wired up.

  • BM25 — over symbol names, signatures, and summaries.
  • HNSW — filter-aware, with the granularity discriminator pushed into the predicate so all three tiers (symbol / file / community) share one index without recall collapse.
  • Multi-hop traversal — Cypher-emitting dialect on the LadybugDB graph store.

Embeddings are optional, gated on PipelineOptions.embeddings. The backend cascade is SageMaker → HTTP / OpenAI-compatible → local ONNX.

Scanners run separately through the scan MCP tool, merging SARIF onto disk and indexing findings back into the nodes table.

See Embeddings and Scanners and SARIF.

One job: group related symbols into communities (Louvain) and walk call chains to produce processes (handler → service → data access). Both are precomputed so MCP tools read them directly.

Symbol-level LLM summaries are produced here when enabled. Summaries are fused into the symbol-tier embedding text at ingestion time (not query time) so retrieval runs against a pre-fused vector.

See Summarization and fusion.

One job: expose the graph through an stdio MCP server (codehub mcp). Twenty-nine tools, seven resources, zero canned prompts. Every tool returns a structured envelope with next_steps and, when the index lags HEAD, a _meta["codehub/staleness"] block. No daemon, no socket, no remote state.

See MCP overview and MCP tools.

OpenCodeHub’s primary user is an AI coding agent that needs callers, callees, processes, and blast radius in one tool call — and needs the answer to be reproducible across runs. The six-phase shape is the cheapest configuration that hits all three:

  • Local + offline. The default storage stack is embedded; codehub analyze --offline opens zero sockets.
  • Deterministic. Phases are pure: same inputs → same outputs, byte-identical graphHash. The graphHash invariant holds over the LadybugDB graph tier. See Determinism.
  • Apache-2.0, every transitive dep on the permissive allowlist. No BSL, no AGPL, no source-available engines in the core. See Supply chain.
ADRTopic
0001Storage backend selection — DuckDB + hnsw_acorn + fts (the v1.0 baseline).
0002Rust core deferred — v2.0 stays pure TypeScript.
0004Hierarchical embeddings — one table, three granularities, filter-aware HNSW.
0005SCIP replaces LSP — compiler-grade edges without long-running language servers.
0006SCIP indexer CI pins — current version table per language.
0007–0010Artifact factory, document pattern, output conventions, dogfood findings.
0011LadybugDB (phase-1) — graph-native backend behind the IGraphStore seam.
0012Repo as a first-class graph node — repo_uri, group registry, AMBIGUOUS_REPO envelope.
0013 (storage)M7 default-flip + interface segregation. Superseded by 0016.
0013 (parse)WASM-default parse runtime, native opt-in. Superseded by 0015.
0014SCIP REFERENCES + TYPE_OF emission, embedder modelId stamping.
0015WASM-only parser — web-tree-sitter is the only runtime on Node 20/22/24; native opt-in removed.
0016DuckDB graph backend ripped out — LadybugDB graph + DuckDB temporal, both always present, no selection knob.

See ADRs for the full list.