Skip to content

Indexing a repo

codehub analyze is the full indexing pipeline: parse with tree-sitter (and SCIP for every language with a pinned indexer — TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin, Ruby), resolve imports and inheritance, detect processes and clusters, build BM25 and HNSW indexes, and write everything to .codehub/ under the repo root.

The default backend is LadybugDB for the graph half + DuckDB for the temporal sibling. Set CODEHUB_STORE=duck to force the single-file DuckDB layout. See Storage backend.

index the current repo
codehub analyze

Re-run after significant changes. A no-op short-circuit skips work if the index already matches HEAD; pass --force to rebuild.

full index with embeddings
codehub analyze --embeddings

--embeddings computes symbol and optional file/community vectors and writes them to the HNSW index. After this, codehub query fuses BM25 and vector results via reciprocal-rank fusion (RRF).

Memory-constrained machines can use --embeddings-int8 for quantised vectors, --embeddings-workers auto to tune the worker pool, or --embeddings-batch-size 32 (default) to tune batch throughput.

offline mode — no sockets
codehub analyze --offline

--offline disables every code path that would open a socket. Combine with cached embedder weights (see codehub setup --embeddings --model-dir <path>) to index fully air-gapped.

check index freshness
codehub status

status compares the index against the working tree and reports staleness. MCP responses also carry an envelope field _meta["codehub/staleness"] whenever the index lags HEAD, so agents can detect drift without polling.

delete the .codehub/ directory
codehub clean

codehub clean --all deletes every index registered on the machine and wipes ~/.codehub/registry.json.

index at symbol, file, and community level
codehub analyze --granularity symbol,file,community

The pipeline produces hierarchical embeddings so a single query can surface a symbol, the file that contains it, and the community the symbol participates in. The default granularity is symbol.

The contents depend on the storage backend selected at index time. On the default LadybugDB layout:

PathPurpose
graph.lbugLadybugDB graph store — symbols, edges, embeddings, BM25 + HNSW indexes.
temporal.duckdbDuckDB sibling — cochanges, symbol-summary cache.
meta.jsonIndex metadata (graph hash, node counts, CLI version, toolchain pins, embedder modelId).
scan.sarifSARIF scan output when codehub scan has run.
sbom.cyclonedx.json / sbom.spdx.jsonSBOMs when codehub analyze --sbom has run.

On the single-file DuckDB fallback, graph.duckdb replaces both graph.lbug and temporal.duckdb.

  • --sbom — emit a CycloneDX SBOM alongside the index.
  • --coverage — bridge coverage data into the graph.
  • --summaries / --no-summaries — LLM-generated symbol summaries (default on; capped by --max-summaries, default auto = 10% of callables, hard cap 500).
  • --skills — generate Claude Code skills from the graph.
  • --native-parser — opt into the native tree-sitter N-API addon on Node 22 (the default runtime is web-tree-sitter / WASM).
  • --strict-detectors — fail the build if a detector (DET-O-001) regresses.
  • --verbose — noisier logs.

See CLI reference: analyze for the complete flag list.