SCIP reconciliation
SCIP is the augmenter, not the primary. OpenCodeHub’s default resolver produces a graph on its own; SCIP then runs for each detected language, produces compiler-grade occurrences, and reconciles against the heuristic edges. Heuristic edges never get deleted — they get demoted. This page covers the ingest path, reconciliation, and the corners that took a few iterations to get right.
Why SCIP is an augmenter
Section titled “Why SCIP is an augmenter”Three reasons SCIP does not replace the default resolver:
- Not every language has an indexer, and the indexers that exist have different install paths and runtime expectations.
- SCIP requires a buildable repo. Missing dependencies, unsettable credentials, or a half-written feature branch all make the indexer fall over. The heuristic resolver still produces a usable graph.
- Rust, Java, and the JVM-driven indexers need build scripts to
run. Build-script execution is gated behind
CODEHUB_ALLOW_BUILD_SCRIPTS=1. Heuristic parsing is always safe.
SCIP contributes CALLS, REFERENCES, IMPLEMENTS, and TYPE_OF
edges with confidence=1.0 — the oracle tier — and the reconciliation
phase rescales any colliding heuristic edge to confidence=0.2 with a
+scip-unconfirmed suffix on the reason. The REFERENCES + TYPE_OF
emission landed in ADR 0014.
Indexer inventory
Section titled “Indexer inventory”| Indexer | Languages | Install channel |
|---|---|---|
scip-typescript | TypeScript, TSX, JavaScript | npm install -g @sourcegraph/scip-typescript |
scip-python | Python | uv tool install scip-python |
scip-go | Go | go install github.com/scip-code/scip-go/cmd/scip-go@<pin> |
rust-analyzer | Rust | rustup component add rust-analyzer rust-src |
scip-java | Java | coursier install scip-java |
scip-dotnet | C# | dotnet tool install --global scip-dotnet |
scip-clang | C, C++ | Vendor binary; consumes a JSON compilation database. |
scip-kotlin | Kotlin | scip-kotlin@^0.6.0 (requires Kotlin 2.2+). |
scip-ruby | Ruby | scip-ruby via gem; reads sorbet/config if present. |
Pins live in .github/workflows/gym.yml so gym replay catches drift.
ADR 0006 covers the rationale for individual pins and install
channels.
The .scip ingest path
Section titled “The .scip ingest path”@opencodehub/scip-ingest hand-rolls the protobuf reader (~130 LOC)
instead of pulling in buf plus codegen — the SCIP schema is small
enough that the extra build-time dependency is not worth the
maintenance burden. The public API is narrow: parseScipIndex,
deriveIndex, deriveEdges, buildSymbolDefIndex, materialize,
runIndexer, detectLanguages, scipProvenanceReason.
The phase flow:
detectLanguages(repo)— fs-based heuristic (tsconfig.json, pyproject.toml, go.mod, Cargo.toml, pom.xml / build.gradle / build.sbt).- For each detected language,
runIndexer()spawns the per-language binary and writes.codehub/scip/<kind>.scip. Fan-out usesPromise.all; a per-language failure never aborts the run. parseScipIndexdecodes the protobuf into typed wire shapes.deriveIndex+deriveEdgesattribute each occurrence to a caller (via innermost-enclosingenclosing_range) and a callee (via asymbolDeftable keyed onSCIP_ROLE_DEFINITIONoccurrences).emitEdges()writesCALLSedges withconfidence=1.0andreason=scip:<indexer>@<version>.
A cached .scip artifact that passes the freshness check is reused;
re-running an indexer is expensive, especially rust-analyzer.
Confidence demote
Section titled “Confidence demote”The confidence-demote phase runs immediately after scip-index and
carries three constants:
HEURISTIC_CONFIDENCE = 0.5DEMOTED_CONFIDENCE = 0.2ORACLE_CONFIDENCE = 1.0UNCONFIRMED_SUFFIX = "+scip-unconfirmed"It iterates edges twice: first to build the set of
(from, type, to) triples that SCIP has confirmed, second to demote
any matching heuristic edge. Three edge types are demotable: CALLS,
REFERENCES, EXTENDS. The demoted edge keeps its original reason
with the +scip-unconfirmed suffix so provenance is visible.
The invariant: SCIP replaces (never rejects) heuristic edges — demote only, do not delete. Downstream consumers can still filter on confidence; the information is not lost.
Provenance tagging
Section titled “Provenance tagging”Every oracle-derived edge carries a reason of the form
scip:<indexer>@<version>, e.g. scip:scip-python@0.6.6. The
prefix set is declared once in @opencodehub/core-types and consumers
(summarizer trust filter, verdict, MCP tools) test against the
exported list rather than string-matching ad hoc. New indexers
(scip-clang, scip-dotnet, scip-kotlin, scip-ruby) are appended to the
same list as they land.
The pipeline slice
Section titled “The pipeline slice”reconcile is the phase that makes heuristic and oracle edges
coherent. Only CALLS edges currently flow from SCIP (see
limitations below).
Known gotchas
Section titled “Known gotchas”The design has been shaped by four durable lessons. Each one is a concrete bug that was found, fixed, and captured:
- Callee resolution must go through
symbolDefkeyed onSCIP_ROLE_DEFINITION. Resolving a callee from the first-seen call site routes same-named symbols to wrong local nodes — a Python method namedsavein multiple classes all collapse onto whicheversave()call happened first in the file. ThebuildSymbolDefIndexpath is the fix. See durable lessonarchitecture-patterns/scip-callee-definition-site.md. - TS monorepos emit
dist/paths in cross-package refs andsrc/paths in defs. ThesymbolDeftable aliases the two so a reference to@acme/core/dist/foo.jsbinds to its definition inpackages/core/src/foo.ts. See durable lessonarchitecture-patterns/scip-monorepo-dist-src-alias.md. - SCIP is 0-indexed, the graph is 1-indexed. The
+1conversion lives at the boundary inscip-index.ts. Getting this wrong shifts every caller attribution by one line. See durable lessonconventions/scip-0-indexed-vs-graph-1-indexed.md. - The protobuf reader is hand-rolled on purpose. SCIP’s schema
is small and stable; pulling in buf plus codegen would pay a
recurring build-time cost for decoding logic that fits in 130
lines. See durable lesson
conventions/scip-protobuf-hand-rolled-reader.md.
Status
Section titled “Status”Both REFERENCES and TYPE_OF are now emitted from SCIP alongside
CALLS and IMPLEMENTS. ADR 0014 describes the wire-up that landed
the missing edge classes plus the embedder-modelId fingerprint
refusal at query time.
Configuration knobs
Section titled “Configuration knobs”CODEHUB_DISABLE_SCIP=1— the phase is a full no-op.CODEHUB_ALLOW_BUILD_SCRIPTS=1— required for the rust + java runners (build.rs, gradle).PipelineOptions.offline === true— skips indexer runs entirely; cached.scipartifacts are still consumed if present.
Further reading
Section titled “Further reading”- ADR 0005 — SCIP replaces LSP — why SCIP (no long-running language server) over LSP.
- ADR 0006 — SCIP indexer pins — the version table and rationale.
- Determinism — gym replay catches indexer drift before it lands in main.
- Durable lessons:
architecture-patterns/scip-replaces-lsp.md,architecture-patterns/scip-callee-definition-site.md,architecture-patterns/scip-monorepo-dist-src-alias.md,conventions/scip-0-indexed-vs-graph-1-indexed.md,conventions/scip-protobuf-hand-rolled-reader.md.