How PermitPulse trains, evaluates, and ships its classifier. Plain-language summary; the authoritative spec is permitpulse-engineering.md §22, §27.1, §30, §32.1.
Every permit ingested from a municipal portal is labeled with trade (solar, roofing, hvac, electrical, plumbing, other), project_size (small/medium/large), est_value_usd, and confidence (0.00-1.00). Every row is stamped with classifier_version for reproducibility and rollback.
classification_v1: 500 permits hand-pulled wk0 from 5 MA municipalities, hand-labeled.classification_adversarial_v1: ~100 hand-crafted edge cases across 10 categories: PV-repair vs install, panel-only-no-PV, HVAC+EV combos, roof-tear-off-as-solar-prep, truncated descriptions, license-only records, bilingual / code-switched descriptions, misspellings, ambiguous overlap, compound multi-trade.Both sets are locked, append-only. Expansion to _v2, _v3, … never mutates prior versions. CI gates on highest version.
No prompt change deploys unless every metric passes on both eval sets:
| Metric | Threshold | What it catches |
|---|---|---|
| accuracy | ≥ 0.95 | Exact trade match. |
| macro F1 | ≥ 0.90 | Balanced precision/recall across 6 classes. |
| per-class recall (min, present classes) | ≥ 0.85 | Sparse-class collapse — e.g. plumbing rare in MA solar-first GT. |
| Brier score | ≤ 0.05 | Calibration: a confidence of 0.95 should be wrong only 5% of the time. |
| ECE (10-bin) | ≤ 0.10 | Calibration drift across confidence bands. |
| per-municipality accuracy | ≥ 0.90 each | Source-shape regression isolation (CitizenServe vs ViewPoint vs eTRAKiT vs OpenGov). |
Wkly cron runs OLS on each metric over the last 4 wks. Slope thresholds:
prompt_versionpaid_digest_min_confidence auto-tightens by +0.05 until calibration recovers| Tier | Confidence | Where it appears |
|---|---|---|
| Headline | ≥ 0.95 | "Notable Permits" + "Permit of the Week" |
| Default | 0.85 ≤ c < 0.95 | Main digest body |
| Newly Observed | 0.70 ≤ c < 0.85 | Segregated section, labeled "lower confidence" |
| Excluded | < 0.70 | Never in paid digest; supervisor queue only |
Every digest permit row carries a signed magic-link allowing subscribers to report won / lost / bidding / no_contact / wrong_trade / wrong_size. ≥3 reports of wrong_trade on the same permit pattern → label correction proposal → user-button-push confirm → new GT version. The classifier learns from production outcomes. See /api/outcome/health.
GT expansions past v1 use independent LLM judges (Anthropic + OpenAI, both temp=0). Auto-accept on agreement; user button-push only on disagreement. Spec: data/eval/gt_label_protocol.md.
The eval sets, evaluation code, prompt history, and metric definitions are all version-controlled in the PermitPulse repository. A 3rd-party can re-label a random 50-record sample of GT and reproduce the published accuracy within ±2pp.