Indexing a repo

codehub analyze is the full indexing pipeline: parse with tree-sitter (and SCIP for every language with a pinned indexer — TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin, Ruby), resolve imports and inheritance, detect processes and clusters, build BM25 and HNSW indexes, and write everything to .codehub/ under the repo root.

The graph half is always LadybugDB (.codehub/graph.lbug) and the temporal sibling is always DuckDB (.codehub/temporal.duckdb). Both files are written on every analyze — there is no backend knob and no single-file fallback. See Storage backend.

Basic indexing

codehub analyze

Re-run after significant changes. A no-op short-circuit skips work if the index already matches HEAD; pass --force to rebuild.

Add semantic vectors

codehub analyze --embeddings

--embeddings computes symbol and optional file/community vectors and writes them to the HNSW index. After this, codehub query fuses BM25 and vector results via reciprocal-rank fusion (RRF).

Memory-constrained machines can use --embeddings-int8 for quantised vectors, --embeddings-workers auto to tune the worker pool, or --embeddings-batch-size 32 (default) to tune batch throughput.

Zero-network indexing

codehub analyze --offline

--offline disables every code path that would open a socket. Combine with cached embedder weights (see codehub setup --embeddings --model-dir <path>) to index fully air-gapped.

Staleness and status

codehub status

status compares the index against the working tree and reports staleness. MCP responses also carry an envelope field _meta["codehub/staleness"] whenever the index lags HEAD, so agents can detect drift without polling.

Resetting the index

codehub clean

codehub clean --all deletes every index registered on the machine and wipes ~/.codehub/registry.json.

Granularity

codehub analyze --granularity symbol,file,community

The pipeline produces hierarchical embeddings so a single query can surface a symbol, the file that contains it, and the community the symbol participates in. The default granularity is symbol.

What lives in `.codehub/`

Every index writes the same two-file layout — LadybugDB for the graph, DuckDB for the temporal sibling:

Path	Purpose
`graph.lbug`	LadybugDB graph store — symbols, edges, embeddings, BM25 + HNSW indexes.
`temporal.duckdb`	DuckDB sibling — cochanges, symbol-summary cache.
`meta.json`	Index metadata (graph hash, node counts, CLI version, toolchain pins, embedder modelId).
`scan.sarif`	SARIF scan output when `codehub scan` has run.
`sbom.cyclonedx.json` / `sbom.spdx.json`	SBOMs when `codehub analyze --sbom` has run.

What runs by default

A bare codehub analyze produces a production-grade .codehub/ folder in one command:

Graph pipeline (tree-sitter parse + SCIP resolution + communities + processes + cochanges + ownership + dependencies + detectors).
SBOM emission (CycloneDX + SPDX) — default on; suppress with --no-sbom.
Priority-1 scanners → .codehub/scan.sarif + findings ingested into the graph — default on; suppress with --no-scan. Network-backed scanners (osv-scanner, grype, npm/pip audit) self-skip under --offline, so the on-default stays honest.
Coverage overlay — default auto: runs only when a report is present at coverage/lcov.info, lcov.info, coverage.xml, build/reports/jacoco/test/jacocoTestReport.xml, or coverage.json. Silent no-op otherwise. Force with --coverage; force off with --no-coverage.

Everything else — embeddings, summaries, skills — is opt-in.

Opt-in flags

--embeddings — compute semantic vectors for queries by meaning. Requires codehub setup --embeddings first.
--summaries / --no-summaries — LLM-generated symbol summaries (default off — codehub analyze is fast, local, deterministic by default; opt in with --summaries or CODEHUB_BEDROCK_SUMMARIES=1). When enabled, the budget is capped by --max-summaries, default auto = 10% of callables, hard cap 500.
--skills — generate Claude Code skills from the graph.
--strict-detectors — fail the build if a detector (DET-O-001) regresses.
--verbose — noisier logs.

See CLI reference: analyze for the complete flag list.