Reproducible evidence

Real plots, real datasets

This page separates measured demo evidence from conceptual architecture. The current plots are generated locally from public sklearn datasets and saved with CSV outputs.

Selective routing run

Dataset: Wisconsin Diagnostic Breast Cancer, 569 cases, 30 features, 2 classes. Model: standardized logistic regression. Split: 200 held-out test cases. Full-test accuracy in this run is 99.0%.

89.0%auto-execute with 0 observed errors
11.0%review at zero-error point
97.5%auto-execute at 1% target risk
0.51%observed error at that point
Risk coverage plot generated from the Wisconsin Diagnostic Breast Cancer dataset
CSV outputs: risk-coverage curve, operating points.

Conformal prediction-set run

Dataset: sklearn Digits, 1,797 handwritten digit images, 64 features, 10 classes. Model: standardized logistic regression. Split: train, 405 calibration cases, 450 test cases. Auto-execution means the conformal prediction set contains exactly one label.

1%alpha operating point
100%observed test coverage
68.9%singleton auto-execution
31.1%review rate
Conformal review budget plot from sklearn Digits
CSV output: digits_conformal_sets.csv.
Conformal empirical coverage plot from sklearn Digits
This is one random split, not a production guarantee. Production use needs rolling calibration and drift monitoring.

Real agent benchmark context

The site also includes a fetched SWE-bench Verified test split: 500 real GitHub issue-fix tasks from public repositories. This is not a router result, but it grounds the enterprise-agent story in an actual software-agent benchmark rather than only classifier proxies.