Indexing a repo
codehub analyze is the full indexing pipeline: parse with
tree-sitter (and SCIP for every language with a pinned indexer —
TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin, Ruby), resolve
imports and inheritance, detect processes and clusters, build BM25
and HNSW indexes, and write everything to .codehub/ under the repo
root.
The default backend is LadybugDB for the graph half +
DuckDB for the temporal sibling. Set CODEHUB_STORE=duck to
force the single-file DuckDB layout. See
Storage backend.
Basic indexing
Section titled “Basic indexing”codehub analyzeRe-run after significant changes. A no-op short-circuit skips work if
the index already matches HEAD; pass --force to rebuild.
Add semantic vectors
Section titled “Add semantic vectors”codehub analyze --embeddings--embeddings computes symbol and optional file/community vectors and
writes them to the HNSW index. After this, codehub query fuses BM25
and vector results via reciprocal-rank fusion (RRF).
Memory-constrained machines can use --embeddings-int8 for quantised
vectors, --embeddings-workers auto to tune the worker pool, or
--embeddings-batch-size 32 (default) to tune batch throughput.
Zero-network indexing
Section titled “Zero-network indexing”codehub analyze --offline--offline disables every code path that would open a socket. Combine
with cached embedder weights (see codehub setup --embeddings --model-dir <path>) to index fully air-gapped.
Staleness and status
Section titled “Staleness and status”codehub statusstatus compares the index against the working tree and reports
staleness. MCP responses also carry an envelope field
_meta["codehub/staleness"] whenever the index lags HEAD, so agents
can detect drift without polling.
Resetting the index
Section titled “Resetting the index”codehub cleancodehub clean --all deletes every index registered on the machine and
wipes ~/.codehub/registry.json.
Granularity
Section titled “Granularity”codehub analyze --granularity symbol,file,communityThe pipeline produces hierarchical embeddings so a single query can
surface a symbol, the file that contains it, and the community the
symbol participates in. The default granularity is symbol.
What lives in .codehub/
Section titled “What lives in .codehub/”The contents depend on the storage backend selected at index time. On the default LadybugDB layout:
| Path | Purpose |
|---|---|
graph.lbug | LadybugDB graph store — symbols, edges, embeddings, BM25 + HNSW indexes. |
temporal.duckdb | DuckDB sibling — cochanges, symbol-summary cache. |
meta.json | Index metadata (graph hash, node counts, CLI version, toolchain pins, embedder modelId). |
scan.sarif | SARIF scan output when codehub scan has run. |
sbom.cyclonedx.json / sbom.spdx.json | SBOMs when codehub analyze --sbom has run. |
On the single-file DuckDB fallback, graph.duckdb replaces both
graph.lbug and temporal.duckdb.
Other useful flags
Section titled “Other useful flags”--sbom— emit a CycloneDX SBOM alongside the index.--coverage— bridge coverage data into the graph.--summaries/--no-summaries— LLM-generated symbol summaries (default on; capped by--max-summaries, default auto = 10% of callables, hard cap 500).--skills— generate Claude Code skills from the graph.--native-parser— opt into the native tree-sitter N-API addon on Node 22 (the default runtime isweb-tree-sitter/ WASM).--strict-detectors— fail the build if a detector (DET-O-001) regresses.--verbose— noisier logs.
See CLI reference: analyze for the complete flag list.