Indexing a repo
codehub analyze is the full indexing pipeline: parse with
tree-sitter (and SCIP for every language with a pinned indexer —
TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin, Ruby), resolve
imports and inheritance, detect processes and clusters, build BM25
and HNSW indexes, and write everything to .codehub/ under the repo
root.
The graph half is always LadybugDB (.codehub/graph.lbug) and the
temporal sibling is always DuckDB (.codehub/temporal.duckdb). Both
files are written on every analyze — there is no backend knob and no
single-file fallback. See
Storage backend.
Basic indexing
Section titled “Basic indexing”codehub analyzeRe-run after significant changes. A no-op short-circuit skips work if
the index already matches HEAD; pass --force to rebuild.
Add semantic vectors
Section titled “Add semantic vectors”codehub analyze --embeddings--embeddings computes symbol and optional file/community vectors and
writes them to the HNSW index. After this, codehub query fuses BM25
and vector results via reciprocal-rank fusion (RRF).
Memory-constrained machines can use --embeddings-int8 for quantised
vectors, --embeddings-workers auto to tune the worker pool, or
--embeddings-batch-size 32 (default) to tune batch throughput.
Zero-network indexing
Section titled “Zero-network indexing”codehub analyze --offline--offline disables every code path that would open a socket. Combine
with cached embedder weights (see codehub setup --embeddings --model-dir <path>) to index fully air-gapped.
Staleness and status
Section titled “Staleness and status”codehub statusstatus compares the index against the working tree and reports
staleness. MCP responses also carry an envelope field
_meta["codehub/staleness"] whenever the index lags HEAD, so agents
can detect drift without polling.
Resetting the index
Section titled “Resetting the index”codehub cleancodehub clean --all deletes every index registered on the machine and
wipes ~/.codehub/registry.json.
Granularity
Section titled “Granularity”codehub analyze --granularity symbol,file,communityThe pipeline produces hierarchical embeddings so a single query can
surface a symbol, the file that contains it, and the community the
symbol participates in. The default granularity is symbol.
What lives in .codehub/
Section titled “What lives in .codehub/”Every index writes the same two-file layout — LadybugDB for the graph, DuckDB for the temporal sibling:
| Path | Purpose |
|---|---|
graph.lbug | LadybugDB graph store — symbols, edges, embeddings, BM25 + HNSW indexes. |
temporal.duckdb | DuckDB sibling — cochanges, symbol-summary cache. |
meta.json | Index metadata (graph hash, node counts, CLI version, toolchain pins, embedder modelId). |
scan.sarif | SARIF scan output when codehub scan has run. |
sbom.cyclonedx.json / sbom.spdx.json | SBOMs when codehub analyze --sbom has run. |
What runs by default
Section titled “What runs by default”A bare codehub analyze produces a production-grade .codehub/ folder
in one command:
- Graph pipeline (tree-sitter parse + SCIP resolution + communities + processes + cochanges + ownership + dependencies + detectors).
- SBOM emission (CycloneDX + SPDX) — default on; suppress with
--no-sbom. - Priority-1 scanners →
.codehub/scan.sarif+ findings ingested into the graph — default on; suppress with--no-scan. Network-backed scanners (osv-scanner, grype, npm/pip audit) self-skip under--offline, so the on-default stays honest. - Coverage overlay — default auto: runs only when a report is
present at
coverage/lcov.info,lcov.info,coverage.xml,build/reports/jacoco/test/jacocoTestReport.xml, orcoverage.json. Silent no-op otherwise. Force with--coverage; force off with--no-coverage.
Everything else — embeddings, summaries, skills — is opt-in.
Opt-in flags
Section titled “Opt-in flags”--embeddings— compute semantic vectors for queries by meaning. Requirescodehub setup --embeddingsfirst.--summaries/--no-summaries— LLM-generated symbol summaries (default off —codehub analyzeis fast, local, deterministic by default; opt in with--summariesorCODEHUB_BEDROCK_SUMMARIES=1). When enabled, the budget is capped by--max-summaries, defaultauto= 10% of callables, hard cap 500.--skills— generate Claude Code skills from the graph.--strict-detectors— fail the build if a detector (DET-O-001) regresses.--verbose— noisier logs.
See CLI reference: analyze for the complete flag list.