Scanners and SARIF
Scanners are a tier-one MCP surface: the scan tool is the only tool
that spawns processes (openWorldHint=true) and the only tool that is
non-idempotent. SARIF is the on-disk exchange format. This page
covers the catalog, the license distinction between bundled and
wrapped tools, how SARIF enrichment stays GHAS-compatible, and how
baseline diffs get bucketized.
Scanner inventory (20)
Section titled “Scanner inventory (20)”The catalog at packages/scanners/src/catalog.ts is a flat module:
one exported ScannerSpec per tool plus aggregate arrays. Selection
is driven by the project profile (languages, IaC types, API contracts)
and can be overridden with an explicit scanners list on the scan
tool. After PR #72 added detect-secrets, the inventory is 20
scanners:
| Scanner | Scope |
|---|---|
semgrep | Multi-language static analysis. |
betterleaks | Secrets — permissive license. |
detect-secrets | Secrets — entropy + pattern based. |
osv-scanner | Lockfile vulnerability scan against OSV. |
bandit | Python static security. |
biome | TS/JS lint + format. |
pip-audit | Python dependency CVE scan. |
npm-audit | npm dependency CVE scan. |
ruff | Python lint + format. |
grype | Container image + filesystem vulnerability scan. |
checkov-docker-compose | IaC policy — docker-compose. |
vulture | Python dead-code detection. |
trivy | Container / IaC / SBOM scanner. |
checkov | IaC policy — Terraform, Kubernetes, CloudFormation, Helm. |
hadolint | Dockerfile lint (subprocess-only — see license note). |
tflint | Terraform lint (subprocess-only). |
spectral | OpenAPI / AsyncAPI contract lint. |
radon | Python complexity + maintainability metrics. |
ty | Python type checker. |
clamav | Malware scan — opt-in only. |
A 21st scanner — och self-scan — is integrated through the OCH
graph itself (dead code, orphan symbols, group-level findings) and
runs as a CI workflow rather than through the scan tool.
License-incompatible wrappers
Section titled “License-incompatible wrappers”hadolint (GPL-3.0) and tflint (MPL-2.0 + BUSL-1.1 depending on vendor build) are not on the permissive license allowlist. OpenCodeHub still supports them the same way it supports any other scanner: wrap, don’t link.
Concretely:
packages/scanners/src/wrappers/hadolint.tsand.../tflint.tsspawn the OS binary, capture stdout as SARIF, and emit findings.- The binary is a user-provided runtime dependency. OpenCodeHub does not bundle it, ship it, or require it at install time.
- License obligations flow to the user who installed the scanner, not to OpenCodeHub.
This is the same pattern GitHub CodeQL uses with third-party SARIF producers. See Supply chain for the broader policy.
A missing binary yields an empty SARIF run, not a crash — the catalog is built to degrade gracefully when a wrapper’s tool is not installed.
SARIF emission
Section titled “SARIF emission”@opencodehub/sarif owns the schema, merge, enrichment, suppressions,
and baseline logic. Every scanner run produces SARIF v2.1.0,
zod-validated against the spec.
Rule IDs and fingerprints
Section titled “Rule IDs and fingerprints”Two fingerprints are computed per result, under
properties.opencodehub.*:
opencodehub/v1—sha256(scannerId \0 ruleId \0 filePath \0 contextHash)[:32]. The match key for baseline diffing.primaryLocationLineHash—sha256(ruleId \0 filePath \0 normalizedSnippet)[:16] + ":" + startLine. The GHAS dedup key.
Invariant: result.fingerprints, partialFingerprints, ruleId,
and artifactLocation.uri are never mutated by enrichment. All
enrichment goes under properties.opencodehub.*. This is how SARIF
output stays GHAS-compatible — GitHub’s deduplication on
primaryLocationLineHash still works.
Enrichment fields
Section titled “Enrichment fields”enrichWithProperties adds graph-derived context to each result:
blastRadius— dependent count fromimpact.community— the containing Louvain community.cochangeScore— temporal co-change coefficient.centrality— node centrality.temporalFixDensity— how often this file has been a fix target.busFactor— unique recent authors.cyclomaticComplexity— McCabe complexity of the enclosing function.ownershipDrift— recent change in top contributor.
Suppressions
Section titled “Suppressions”Two paths, same output:
- External YAML —
.codehub/suppressions.yamldeclares{ruleId, filePathPattern, reason, expiresAt?}. - Inline comment —
// codehub-suppress: <ruleId> <reason>(or#,/* */variants) in source.
Both write to result.suppressions[] with {kind: "external"|"inSource", justification}. Suppressions past their
expiresAt are dropped at load with a warning, so codehub verdict
can re-block the finding.
Findings baseline and delta
Section titled “Findings baseline and delta”Two SARIF files on disk:
.codehub/scan.sarif— the current scan..codehub/baseline.sarif— the frozen baseline written bycodehub scan --baseline.
list_findings_delta reads both and runs diffSarif. The match key
is the opencodehub/v1 partial fingerprint, with a fallback to
(ruleId, uri, startLine) when the fingerprint is missing. Rename
follow-through is optional: if the storage layer supplies a
renameChainFor resolver (backed by FileNode.renameHistoryChain
from the temporal phase), a finding that followed a rename still
matches.
Four buckets:
| Bucket | Meaning |
|---|---|
new | In current, not in baseline. |
fixed | In baseline, not in current. |
unchanged | Same fingerprint, same contextHash. |
updated | Same fingerprint, changed line / snippet. |
When the current SARIF already carries baked-in baselineState tags
(written by codehub scan --baseline), list_findings_delta reuses
them instead of re-running the diff — the on-disk SARIF is the source
of truth.
The scan tool
Section titled “The scan tool”scan is deliberately the odd one out. Annotations:
readOnlyHint: falsedestructiveHint: falseopenWorldHint: true // spawns subprocessesidempotentHint: false // writes disk, state-changingThe tool picks scanners via selectScanners(), which honors an
explicit list or falls back to profile-gated defaults. Concurrency is
clamped to min(availableParallelism(), opts.concurrency ?? 4). A
per-wrapper failure does not abort the run — it just omits that
scanner’s results from the merged SARIF.
The merged SARIF is persisted to .codehub/scan.sarif; a summary
groups result counts by tool.driver.name and result.level
(defaulting to note when the scanner omits the level).
Configuration knobs
Section titled “Configuration knobs”ScanInput.timeoutMs— per-scanner timeout (default 300_000, max 600_000).ScanInput.scanners— explicit id list overrides profile gating.ProjectProfileGate.languages / iacTypes / apiContracts— stored innodes WHERE kind='ProjectProfile'; drives default selection..codehub/suppressions.yaml— external suppression rules.
Related
Section titled “Related”scantool reference — the full input schema.list_findingstool reference — querying findings stored as nodes.- Supply chain — why subprocess invocation is the right pattern for non-permissive scanners.