PermitPulse — Classification Methodology

What classification does

Every permit ingested from a municipal portal is labeled with trade (solar, roofing, hvac, electrical, plumbing, other), project_size (small/medium/large), est_value_usd, and confidence (0.00-1.00). Every row is stamped with classifier_version for reproducibility and rollback.

Ground-truth eval sets

classification_v1: 500 permits hand-pulled wk0 from 5 MA municipalities, hand-labeled.
classification_adversarial_v1: ~100 hand-crafted edge cases across 10 categories: PV-repair vs install, panel-only-no-PV, HVAC+EV combos, roof-tear-off-as-solar-prep, truncated descriptions, license-only records, bilingual / code-switched descriptions, misspellings, ambiguous overlap, compound multi-trade.

Both sets are locked, append-only. Expansion to _v2, _v3, … never mutates prior versions. CI gates on highest version.

Ship gate

No prompt change deploys unless every metric passes on both eval sets:

Metric	Threshold	What it catches
accuracy	≥ 0.95	Exact trade match.
macro F1	≥ 0.90	Balanced precision/recall across 6 classes.
per-class recall (min, present classes)	≥ 0.85	Sparse-class collapse — e.g. plumbing rare in MA solar-first GT.
Brier score	≤ 0.05	Calibration: a confidence of 0.95 should be wrong only 5% of the time.
ECE (10-bin)	≤ 0.10	Calibration drift across confidence bands.
per-municipality accuracy	≥ 0.90 each	Source-shape regression isolation (CitizenServe vs ViewPoint vs eTRAKiT vs OpenGov).

Drift detection

Wkly cron runs OLS on each metric over the last 4 wks. Slope thresholds:

Warn: -0.5 pp/wk (or +0.5 for Brier/ECE)
Crit: -1.0 pp/wk → automatic prompt rollback to last passing prompt_version
ECE > 0.15 → Content-Editorial paid_digest_min_confidence auto-tightens by +0.05 until calibration recovers

Confidence contract — what subscribers see

Tier	Confidence	Where it appears
Headline	≥ 0.95	"Notable Permits" + "Permit of the Week"
Default	0.85 ≤ c < 0.95	Main digest body
Newly Observed	0.70 ≤ c < 0.85	Segregated section, labeled "lower confidence"
Excluded	< 0.70	Never in paid digest; supervisor queue only

Outcome feedback (network effect)

Every digest permit row carries a signed magic-link allowing subscribers to report won / lost / bidding / no_contact / wrong_trade / wrong_size. ≥3 reports of wrong_trade on the same permit pattern → label correction proposal → user-button-push confirm → new GT version. The classifier learns from production outcomes. See /api/outcome/health.

Multi-rater GT protocol (v2+)

GT expansions past v1 use independent LLM judges (Anthropic + OpenAI, both temp=0). Auto-accept on agreement; user button-push only on disagreement. Spec: data/eval/gt_label_protocol.md.

Reproducibility

The eval sets, evaluation code, prompt history, and metric definitions are all version-controlled in the PermitPulse repository. A 3rd-party can re-label a random 50-record sample of GT and reproduce the published accuracy within ±2pp.