Classification methodology

How PermitPulse trains, evaluates, and ships its classifier. Plain-language summary; the authoritative spec is permitpulse-engineering.md §22, §27.1, §30, §32.1.

What classification does

Every permit ingested from a municipal portal is labeled with trade (solar, roofing, hvac, electrical, plumbing, other), project_size (small/medium/large), est_value_usd, and confidence (0.00-1.00). Every row is stamped with classifier_version for reproducibility and rollback.

Ground-truth eval sets

Both sets are locked, append-only. Expansion to _v2, _v3, … never mutates prior versions. CI gates on highest version.

Ship gate

No prompt change deploys unless every metric passes on both eval sets:

MetricThresholdWhat it catches
accuracy≥ 0.95Exact trade match.
macro F1≥ 0.90Balanced precision/recall across 6 classes.
per-class recall (min, present classes)≥ 0.85Sparse-class collapse — e.g. plumbing rare in MA solar-first GT.
Brier score≤ 0.05Calibration: a confidence of 0.95 should be wrong only 5% of the time.
ECE (10-bin)≤ 0.10Calibration drift across confidence bands.
per-municipality accuracy≥ 0.90 eachSource-shape regression isolation (CitizenServe vs ViewPoint vs eTRAKiT vs OpenGov).

Drift detection

Wkly cron runs OLS on each metric over the last 4 wks. Slope thresholds:

Confidence contract — what subscribers see

TierConfidenceWhere it appears
Headline≥ 0.95"Notable Permits" + "Permit of the Week"
Default0.85 ≤ c < 0.95Main digest body
Newly Observed0.70 ≤ c < 0.85Segregated section, labeled "lower confidence"
Excluded< 0.70Never in paid digest; supervisor queue only

Outcome feedback (network effect)

Every digest permit row carries a signed magic-link allowing subscribers to report won / lost / bidding / no_contact / wrong_trade / wrong_size. ≥3 reports of wrong_trade on the same permit pattern → label correction proposal → user-button-push confirm → new GT version. The classifier learns from production outcomes. See /api/outcome/health.

Multi-rater GT protocol (v2+)

GT expansions past v1 use independent LLM judges (Anthropic + OpenAI, both temp=0). Auto-accept on agreement; user button-push only on disagreement. Spec: data/eval/gt_label_protocol.md.

Reproducibility

The eval sets, evaluation code, prompt history, and metric definitions are all version-controlled in the PermitPulse repository. A 3rd-party can re-label a random 50-record sample of GT and reproduce the published accuracy within ±2pp.