Skip to content

Parsing and resolution

This page covers phases 1 and 2 of the pipeline: from source files to typed CALLS / EXTENDS / IMPLEMENTS / FETCHES / ACCESSES edges on the graph. The goal is to explain the moving parts — grammars, the provider registry, resolver flavors, and import semantics — well enough that adding a new language is a mechanical exercise.

Fifteen grammars are pinned through packages/ingestion/package.json and loaded by a worker pool that clamps to max(2, min(cpus, 8)) threads. Each file is hashed and the resulting ParseCapture[] is cached keyed on (sha256, grammarSha, SCHEMA_VERSION), so a subsequent analyze with the same content skips tree-sitter entirely.

The default runtime is web-tree-sitter (WASM) on both Node 22 and Node 24. The native tree-sitter N-API addon is opt-in via OCH_NATIVE_PARSER=1 (or --native-parser) on Node 22 dev boxes where it is measurably faster on large repos. Kotlin, Swift, and Dart ship as .wasm blobs vendored at packages/ingestion/vendor/wasms/; rebuild via bash scripts/build-vendor-wasms.sh after a grammar bump.

The complexity-metrics phase still uses native tree-sitter for cyclomatic-complexity counting. On Node 24 (or Node 22 without the native opt-in) it degrades with a one-shot stderr warning; all other parsing continues through the WASM path. ADR docs/adr/0013-parse-runtime-wasm-default.md covers the decision.

ParseCapture is the shared per-capture schema emitted by the worker — one interface with 7 readonly fields:

interface ParseCapture {
readonly tag: string; // e.g. "definition.function"
readonly text: string;
readonly startLine: number; // 1-indexed
readonly endLine: number;
readonly startCol: number; // 0-indexed
readonly endCol: number;
readonly nodeType: string;
}

The tag vocabulary is a clean-room set (definition.*, reference.*, doc, name) that decouples the downstream providers from each grammar’s internal node naming.

Providers are registered via a compile-time exhaustive table:

export const PROVIDERS = {
typescript: typescriptProvider,
tsx: tsxProvider,
javascript: javascriptProvider,
python: pythonProvider,
go: goProvider,
rust: rustProvider,
java: javaProvider,
csharp: csharpProvider,
c: cProvider,
cpp: cppProvider,
ruby: rubyProvider,
kotlin: kotlinProvider,
swift: swiftProvider,
php: phpProvider,
dart: dartProvider,
} satisfies Record<LanguageId, LanguageProvider>;

The satisfies clause is load-bearing: if LanguageId gains a new member and the table does not, the build fails. getProvider(lang) and listProviders() are the two helpers the pipeline uses to reach providers without hard-coding names.

Each LanguageProvider exposes six hooks — extractDefinitions, extractCalls, extractImports, extractHeritage, detectOutboundHttp, extractPropertyAccesses — plus configuration fields (importSemantics, mroStrategy, optional resolverStrategyName).

Name resolution runs in two tiers. The default walker resolves a reference against three scopes in order:

ScopeConfidence
Same file0.95
Import-scoped0.9
Global0.5

Heritage linearization — which matters when super.foo() can come from any of several bases — is selected per language. Four flavors:

StrategyLanguages
c3Python, Kotlin, Dart, C++, Ruby
first-winsTypeScript, TSX, JavaScript, Rust
single-inheritanceJava, C#, PHP, Swift
noneGo, C

The STRATEGIES record in providers/resolution/mro.ts is the source of truth; each provider declares mroStrategy: MroStrategyName and the resolver dispatches on it.

The provider contract enforces one of three import semantics:

ValueWhat it meansExample languages
namedImports bring specific names into scope.TS/TSX/JS, Rust, Java, C#
namespaceImports bring a namespace; members accessed via dot.Python
package-wildcardWhole package is re-exported as one bag.Go, Kotlin

The package-wildcard value has a concrete consequence: the resolver does not chase cross-module names through the import, because the package re-exports everything and the exact origin file is undecidable from the import site alone. Go’s import "fmt" followed by fmt.Println does not tell the resolver which file inside fmt defines Println; the SCIP augmenter fills that in when present.

Parse emits five edge types directly (DEFINES, HAS_METHOD, HAS_PROPERTY, IMPORTS, EXTENDS, IMPLEMENTS, CALLS). Two more edge types come from later dedicated phases:

  • ACCESSES (read/write) — emitted by the accesses phase from extractPropertyAccesses captures. When no matching field is found, a synthetic Property:unresolved:<name> stub anchors the edge rather than dropping it. Intentional anchoring, not a bug.
  • FETCHES — emitted by the fetches phase from detectOutboundHttp captures. When no local Route matches the URL pattern, the edge targets fetches:unresolved:<id> pseudo-nodes that group_contracts recognizes for cross-repo contract mapping.

Four providers opt into the stack-graphs resolver by setting resolverStrategyName: "stack-graphs":

ProviderDefault resolver confidence gain
typescriptTighter cross-file lookup
tsxSame as typescript
javascriptSame as typescript
pythonAttribute resolution across modules

Stack-graphs adds incremental, precise name-binding over the heuristic three-tier walker — it models scope, inheritance, and imports as a graph whose path-finding produces a deterministic binding. The other 11 providers fall back to the default walker, which is cheaper and good enough given that SCIP is expected to augment the compiled languages.

Stack-graphs-enabled providers route through the stackGraphsRouter side of getResolver() instead of the default walker; the rest of the pipeline is unchanged.

  • Properties without a matching field produce synthetic Property:unresolved:<name> stubs, not dropped edges. Queries that BM25-rank over node IDs will see these stubs compete with real symbols. See the durable lesson linked below.
  • FETCHES without a local route emit to fetches:unresolved:<id> pseudo-targets. These are recognized by group_contracts when fanning out cross-repo contract analysis.
  • DEBUG_PHASE_MEM=1 brackets graphHash with stderr telemetry for memory profiling.
  • PipelineOptions.force bypasses parse-cache lookups (still writes fresh entries). Useful for debugging but not day-to-day.
  • Adding a language provider — the step-by-step contract for adding a 16th language.
  • SCIP reconciliation — how compiler-grade edges demote heuristic ones.
  • Durable lesson: conventions/bm25-over-node-id-favors-stubs.md — why BM25 over node IDs needs to be gated against unresolved stubs.