Code health

A 1–10 health score for every file from deterministic markers across three signals: defect risk, maintainability, and performance. Zero LLM calls, defect-calibrated weights, validated to out-rank the leading commercial code-health tool at predicting real bugs.

Code health is repowise's deepest differentiator. Linters check patterns. The health score predicts which files are likely to harbor the next bug, ranked and validated forward in time against a real defect corpus, and benchmarked head-to-head against the leading commercial tool in the space.

Repowise scores every file from 1 to 10 from deterministic markers computed over tree-sitter ASTs and git history, surfaced as three orthogonal signals: defect risk, maintainability, and performance. No LLM calls, no cloud requirement, no new runtime dependencies, just pure Python that finishes in under 30 seconds on a 3,000-file repo.

The markers

Each file starts at 10.0; marker findings deduct from the score, with a cap per category so no single category can dominate.

Category	Markers
Structural complexity	`brain_method`, `nested_complexity`, `bumpy_road`, `complex_conditional`, `complex_method`, `large_method`, `primitive_obsession`
Cohesion & size	`low_cohesion` (LCOM4), `god_class`
Duplication	`dry_violation` (native Rabin–Karp clone detection)
Error handling	`error_handling` (empty catches, swallowed errors, bare panics)
Test coverage	`untested_hotspot`, `coverage_gap`, `coverage_gradient`
Test quality	`large_assertion_block`, `duplicated_assertion_block`
Organizational / git	`developer_congestion`, `knowledge_loss`, `hidden_coupling`, `function_hotspot`, `code_age_volatility`, `ownership_risk`, `churn_risk`, `change_entropy`, `co_change_scatter`, `prior_defect`
Performance	`io_in_loop` (N+1), `string_concat_in_loop`, `blocking_sync_in_async`, `resource_construction_in_loop`, `serial_await_in_loop`, and more (see the performance signal)

The three repo-level KPIs (all for the defect-risk signal):

Hotspot Health. NLOC-weighted average over the files the git layer classifies as hotspots (high churn percentile plus minimum-activity floors).
Average Health. NLOC-weighted average over all files.
Worst Performer. The single lowest-scoring file.

Calibrated, not hand-tuned

The marker weights are learned offline from a real defect corpus. Each file is scored at the commit immediately before a 6-month defect window (T0, so the measurement can't leak future information), and an L2-regularized logistic regression, with file size (NLOC) as an explicit control, fits each marker's defect lift beyond size. Only the learned constants ship; the runtime stays fully deterministic.

The strongest calibrated predictors: co_change_scatter, change_entropy, ownership_risk, and nested_complexity.

Three health signals: defect risk, maintainability, performance

The score above is the defect-risk signal: calibrated against a defect corpus, with bands tuned to it (Alert files carry roughly 17× the defect rate of Healthy files). It is the overall number surfaced everywhere. But not every code smell predicts bugs, so repowise computes two co-equal companion signals from the same marker stream, never blended into the defect headline.

Maintainability. A handful of markers fire widely and matter a lot for how hard code is to read and change, yet proved weak as defect predictors under leakage-free scoring, so the defect calibration floors them (low_cohesion, brain_method, primitive_obsession, dry_violation, error_handling). Floored inside a defect score they get no credit for the real problem they describe. The maintainability signal deducts them at full weight against its own expert-set caps, so the smell is scored where it actually lives. Structural smells that are both defect predictors and core maintainability concerns (god_class, large_method, nested_complexity) count toward both.
Performance. Static performance risk (see below): high-precision, low-recall, always advisory, never folded into the defect number.

The signals are computed by one shared scoring kernel against independent weight/category/cap tables and never feed back into each other. The overall surfaced score stays exactly the defect score (a golden test locks this byte-for-byte). Every finding carries a dimension (defect / maintainability / performance) naming the pillar it homes under, so findings can be filtered per signal, in the dashboard, the CLI, and the get_health MCP tool.

Performance: static performance risk

The third signal flags shapes that waste work, code whose structure does redundant I/O rather than measured runtime. It is deliberately high-precision, low-recall: a few real findings the rest of the toolchain can trust beat a wall of maybes. The headline detector is io_in_loop (the N+1): a database call, network request, filesystem read, or subprocess spawn that runs once per loop iteration. Two things make it more than a file-local lint:

Dependency classification. The loop-nested call is resolved through a shared I/O-boundary classifier (db / network / filesystem / subprocess / lock) and only fires on a classified execution sink (an actual round-trip), not a query-builder chain or a same-named pure helper.
Call-graph reachability. The loop and the I/O call need not be in the same function. A bounded-depth (≤3 hops) walk over the resolved call graph catches the interprocedural case no file-local linter can see; cross-function findings carry their resolved caller -> ... -> sink path.

Alongside it: string_concat_in_loop, blocking_sync_in_async, resource_construction_in_loop, lock_in_loop, serial_await_in_loop, membership_test_against_list_in_loop, nested_loop_with_io, plus language-specific markers (Go defer_in_loop / goroutine_in_unbounded_loop, Python pd_concat_in_loop, JS/TS array_spread_in_reduce, and more). The signal fires on Python, TypeScript/JavaScript, Java, Go, and C#; a language without a dialect emits no perf findings (never a wrong one). io_in_loop is hand-label validated across an 11-repo OSS corpus (Go 96.7%, TypeScript 100%, Python 96.2% precision).

On a 12k-file benchmark, the standard single-file linters (clippy, ruff PERF, ESLint, golangci-lint) found 0 of the cross-function I/O-in-loop cases that repowise surfaced 557 of. Following the call graph across files is the whole point: a file-local lint cannot see a loop in one function and its I/O sink in another. The clippy line of that comparison is catalogue-level, since a full end-to-end clippy run on the corpus was blocked by a Windows build wall rather than executed.

Performance is a static signal, so it under-reports rather than over-reports: dynamic dispatch, ORM lazy-load N+1, and unmodelled libraries are out of scope by design. That is why we call it performance risk.

Does the score predict real bugs?

Yes, validated across 21 open-source repositories spanning all nine Full-tier languages (Python, TypeScript, JavaScript, Java, Kotlin, Go, Rust, C++, C#):

Result	Value
Cross-project mean ROC AUC (21 repos, 9 languages)	0.74 (mean 0.737) [95% CI 0.68–0.79] (up to 0.90 on individual repos)
Survives controlling for file size	partial Spearman ρ = −0.16
Beats recent-churn baseline	+0.10 AUC (DeLong p < 1e-9)
Beats prior-defect baseline	+0.12 AUC
External, never-seen dataset (PROMISE/jEdit)	AUC 0.76–0.78

Head-to-head vs the leading commercial tool

This is a different corpus from the cross-project number above: 2,770 shared files across 9 languages, scored at the same leakage-free commit against the same defect labels, with paired significance tests. The 0.731 ROC AUC below is on these 2,770 files; the 0.74 above is the cross-project mean over 21 repos. The two are measured on different file sets and are not the same number read twice.

Axis	repowise	Leading commercial tool
Recall @ 20%-of-lines budget	0.173	0.074
Effort-aware ranking (Popt)	0.607	0.462
Defect density, size-normalized (defects/KLOC)	2.18×	0.56×
Discrimination (ROC AUC)	0.731	0.705

The decisive wins here are the effort-aware ones, all paired and significant: ranking by repowise health surfaces 2.3× the defects under a fixed review budget (Popt Δ +0.144, recall Δ +0.098, density Δ, all at p = 0.003). The ROC AUC edge specifically is marginal (Δ +0.026, p = 0.054) and not significant, and precision@20% between the two tools is a tie (p = 0.64). The commercial tool is also the more mature product, with 28+ languages, a published Code Red defect study, and behavioral-analysis features repowise does not match.

Does the score find the bugs, on your repo?

After every index, repowise checks its own claim against the repo's history: of the 20 least-healthy files, how many had a fix: commit in the trailing ~180 days, versus the repo-wide base rate. It prints a one-line callout (e.g. "16/20 lowest-health files had a recent bug fix, 3.3× the 24% baseline") and surfaces the same precision@K / lift stat on the web dashboards and over MCP (get_health(include=["accuracy"])). It stays silent on repos with too little history to be honest (fewer than 25 scored files or fewer than 5 recently-fixed files), and discloses that prior_defect is itself one down-weighted input, so this is an association on indexed history, not a leakage-free forward prediction.

Full methodology, confidence intervals, and reproduction steps live in repowise-bench: the health-defect report and the head-to-head comparison.

Using it

repowise health                       # KPIs + lowest-scoring files
repowise coverage add cov.lcov        # ingest LCOV/Cobertura/Clover -> untested-hotspot
repowise health --refactoring-targets # ranked by impact / effort
repowise health --trend               # snapshots + declining / predicted-decline alerts

Coverage lives behind its own command: repowise coverage add <report> ingests it once, and every later repowise health folds those test gaps into the scores automatically. When the report carries contexts (a coverage.py .coverage, or coverage run --contexts=test), coverage add also builds a per-test test-to-code map.

Three signals surfaced together. Defect risk is the headline; maintainability and performance averages sit alongside it on the dashboard, in repowise status, and in the generated CLAUDE.md.
Coverage ingestion. LCOV, Cobertura, Clover, or normalized JSON light up the test-coverage markers.
Trend tracking. A rolling 50-row snapshot history powers Declining Health and Predicted Decline alerts.
Refactoring targets. Deterministic, rule-based, ranked by impact / effort. A health score says a file is in trouble; refactoring intelligence names the specific fix: five detectors (Extract Class, Extract Helper, Move Method, Break Cycle, Split File) that emit one structured, graph-aware plan per opportunity. Code generation is on by default and can be turned off.
Per-repo policy. .repowise/health-rules.json disables markers per glob and remaps severities (including a named small-team profile); the calibrated weights stay locked so the benchmark claims hold.

Your agent reaches the same data through the get_health MCP tool, and a single-line summary shows up in repowise status. Full CLI reference: repowise health.

How it connects to the other layers

Code health isn't a silo; it reuses signals from every other layer:

Git feeds the organizational markers (ownership, churn, co-change scatter, knowledge loss).
Graph feeds centrality (a brain_method must be central, not just long), hidden_coupling, and the call-graph reachability behind the performance N+1 detector.
Decisions surface as ungoverned_hotspot and stale_governance health findings.

That's why a repowise health score carries more signal than a complexity linter's. A linter sees that a file is complex. The health score sees that the file is complex, central, churned by many hands, and untested, and weighs all four together.

The markers

Each file starts at 10.0; marker findings deduct from the score, with a cap per category so no single category can dominate.

Category	Markers
Structural complexity	`brain_method`, `nested_complexity`, `bumpy_road`, `complex_conditional`, `complex_method`, `large_method`, `primitive_obsession`
Cohesion & size	`low_cohesion` (LCOM4), `god_class`
Duplication	`dry_violation` (native Rabin–Karp clone detection)
Error handling	`error_handling` (empty catches, swallowed errors, bare panics)
Test coverage	`untested_hotspot`, `coverage_gap`, `coverage_gradient`
Test quality	`large_assertion_block`, `duplicated_assertion_block`
Organizational / git	`developer_congestion`, `knowledge_loss`, `hidden_coupling`, `function_hotspot`, `code_age_volatility`, `ownership_risk`, `churn_risk`, `change_entropy`, `co_change_scatter`, `prior_defect`
Performance	`io_in_loop` (N+1), `string_concat_in_loop`, `blocking_sync_in_async`, `resource_construction_in_loop`, `serial_await_in_loop`, and more (see the performance signal)

The three repo-level KPIs (all for the defect-risk signal):

Hotspot Health. NLOC-weighted average over the files the git layer classifies as hotspots (high churn percentile plus minimum-activity floors).
Average Health. NLOC-weighted average over all files.
Worst Performer. The single lowest-scoring file.

Calibrated, not hand-tuned

The strongest calibrated predictors: co_change_scatter, change_entropy, ownership_risk, and nested_complexity.

Three health signals: defect risk, maintainability, performance

Maintainability. A handful of markers fire widely and matter a lot for how hard code is to read and change, yet proved weak as defect predictors under leakage-free scoring, so the defect calibration floors them (low_cohesion, brain_method, primitive_obsession, dry_violation, error_handling). Floored inside a defect score they get no credit for the real problem they describe. The maintainability signal deducts them at full weight against its own expert-set caps, so the smell is scored where it actually lives. Structural smells that are both defect predictors and core maintainability concerns (god_class, large_method, nested_complexity) count toward both.
Performance. Static performance risk (see below): high-precision, low-recall, always advisory, never folded into the defect number.

Performance: static performance risk

Dependency classification. The loop-nested call is resolved through a shared I/O-boundary classifier (db / network / filesystem / subprocess / lock) and only fires on a classified execution sink (an actual round-trip), not a query-builder chain or a same-named pure helper.
Call-graph reachability. The loop and the I/O call need not be in the same function. A bounded-depth (≤3 hops) walk over the resolved call graph catches the interprocedural case no file-local linter can see; cross-function findings carry their resolved caller -> ... -> sink path.

Does the score predict real bugs?

Yes, validated across 21 open-source repositories spanning all nine Full-tier languages (Python, TypeScript, JavaScript, Java, Kotlin, Go, Rust, C++, C#):

Result	Value
Cross-project mean ROC AUC (21 repos, 9 languages)	0.74 (mean 0.737) [95% CI 0.68–0.79] (up to 0.90 on individual repos)
Survives controlling for file size	partial Spearman ρ = −0.16
Beats recent-churn baseline	+0.10 AUC (DeLong p < 1e-9)
Beats prior-defect baseline	+0.12 AUC
External, never-seen dataset (PROMISE/jEdit)	AUC 0.76–0.78

Head-to-head vs the leading commercial tool

Axis	repowise	Leading commercial tool
Recall @ 20%-of-lines budget	0.173	0.074
Effort-aware ranking (Popt)	0.607	0.462
Defect density, size-normalized (defects/KLOC)	2.18×	0.56×
Discrimination (ROC AUC)	0.731	0.705

Does the score find the bugs, on your repo?

Full methodology, confidence intervals, and reproduction steps live in repowise-bench: the health-defect report and the head-to-head comparison.

Using it

repowise health                       # KPIs + lowest-scoring files
repowise coverage add cov.lcov        # ingest LCOV/Cobertura/Clover -> untested-hotspot
repowise health --refactoring-targets # ranked by impact / effort
repowise health --trend               # snapshots + declining / predicted-decline alerts

Three signals surfaced together. Defect risk is the headline; maintainability and performance averages sit alongside it on the dashboard, in repowise status, and in the generated CLAUDE.md.
Coverage ingestion. LCOV, Cobertura, Clover, or normalized JSON light up the test-coverage markers.
Trend tracking. A rolling 50-row snapshot history powers Declining Health and Predicted Decline alerts.
Refactoring targets. Deterministic, rule-based, ranked by impact / effort. A health score says a file is in trouble; refactoring intelligence names the specific fix: five detectors (Extract Class, Extract Helper, Move Method, Break Cycle, Split File) that emit one structured, graph-aware plan per opportunity. Code generation is on by default and can be turned off.
Per-repo policy. .repowise/health-rules.json disables markers per glob and remaps severities (including a named small-team profile); the calibrated weights stay locked so the benchmark claims hold.

Your agent reaches the same data through the get_health MCP tool, and a single-line summary shows up in repowise status. Full CLI reference: repowise health.

How it connects to the other layers

Code health isn't a silo; it reuses signals from every other layer:

Git feeds the organizational markers (ownership, churn, co-change scatter, knowledge loss).
Graph feeds centrality (a brain_method must be central, not just long), hidden_coupling, and the call-graph reachability behind the performance N+1 detector.
Decisions surface as ungoverned_hotspot and stale_governance health findings.

The markers

Calibrated, not hand-tuned

Three health signals: defect risk, maintainability, performance

Performance: static performance risk

Does the score predict real bugs?

Head-to-head vs the leading commercial tool

Does the score find the bugs, on your repo?

Using it

How it connects to the other layers

On this page

Code health

The markers

Calibrated, not hand-tuned

Three health signals: defect risk, maintainability, performance

Performance: static performance risk

Does the score predict real bugs?

Head-to-head vs the leading commercial tool

Does the score find the bugs, on your repo?

Using it

How it connects to the other layers

On this page