Skip to content

Indexing a repo

codehub analyze is the full indexing pipeline: parse with tree-sitter (and SCIP for every language with a pinned indexer — TypeScript, Python, Go, Rust, Java, C#, C/C++, Kotlin, Ruby), resolve imports and inheritance, detect processes and clusters, build BM25 and HNSW indexes, and write everything to .codehub/ under the repo root.

The graph half is always LadybugDB (.codehub/graph.lbug) and the temporal sibling is always DuckDB (.codehub/temporal.duckdb). Both files are written on every analyze — there is no backend knob and no single-file fallback. See Storage backend.

index the current repo
codehub analyze

Re-run after significant changes. A no-op short-circuit skips work if the index already matches HEAD; pass --force to rebuild.

full index with embeddings
codehub analyze --embeddings

--embeddings computes symbol and optional file/community vectors and writes them to the HNSW index. After this, codehub query fuses BM25 and vector results via reciprocal-rank fusion (RRF).

Memory-constrained machines can use --embeddings-int8 for quantised vectors, --embeddings-workers auto to tune the worker pool, or --embeddings-batch-size 32 (default) to tune batch throughput.

offline mode — no sockets
codehub analyze --offline

--offline disables every code path that would open a socket. Combine with cached embedder weights (see codehub setup --embeddings --model-dir <path>) to index fully air-gapped.

check index freshness
codehub status

status compares the index against the working tree and reports staleness. MCP responses also carry an envelope field _meta["codehub/staleness"] whenever the index lags HEAD, so agents can detect drift without polling.

delete the .codehub/ directory
codehub clean

codehub clean --all deletes every index registered on the machine and wipes ~/.codehub/registry.json.

index at symbol, file, and community level
codehub analyze --granularity symbol,file,community

The pipeline produces hierarchical embeddings so a single query can surface a symbol, the file that contains it, and the community the symbol participates in. The default granularity is symbol.

Every index writes the same two-file layout — LadybugDB for the graph, DuckDB for the temporal sibling:

PathPurpose
graph.lbugLadybugDB graph store — symbols, edges, embeddings, BM25 + HNSW indexes.
temporal.duckdbDuckDB sibling — cochanges, symbol-summary cache.
meta.jsonIndex metadata (graph hash, node counts, CLI version, toolchain pins, embedder modelId).
scan.sarifSARIF scan output when codehub scan has run.
sbom.cyclonedx.json / sbom.spdx.jsonSBOMs when codehub analyze --sbom has run.

A bare codehub analyze produces a production-grade .codehub/ folder in one command:

  • Graph pipeline (tree-sitter parse + SCIP resolution + communities + processes + cochanges + ownership + dependencies + detectors).
  • SBOM emission (CycloneDX + SPDX) — default on; suppress with --no-sbom.
  • Priority-1 scanners → .codehub/scan.sarif + findings ingested into the graph — default on; suppress with --no-scan. Network-backed scanners (osv-scanner, grype, npm/pip audit) self-skip under --offline, so the on-default stays honest.
  • Coverage overlay — default auto: runs only when a report is present at coverage/lcov.info, lcov.info, coverage.xml, build/reports/jacoco/test/jacocoTestReport.xml, or coverage.json. Silent no-op otherwise. Force with --coverage; force off with --no-coverage.

Everything else — embeddings, summaries, skills — is opt-in.

  • --embeddings — compute semantic vectors for queries by meaning. Requires codehub setup --embeddings first.
  • --summaries / --no-summaries — LLM-generated symbol summaries (default off — codehub analyze is fast, local, deterministic by default; opt in with --summaries or CODEHUB_BEDROCK_SUMMARIES=1). When enabled, the budget is capped by --max-summaries, default auto = 10% of callables, hard cap 500.
  • --skills — generate Claude Code skills from the graph.
  • --strict-detectors — fail the build if a detector (DET-O-001) regresses.
  • --verbose — noisier logs.

See CLI reference: analyze for the complete flag list.