Reproducible evidence

Real plots, real datasets

This page separates measured demo evidence from conceptual architecture. The current plots are generated locally from public sklearn datasets and saved with CSV outputs.

Selective routing run

Dataset: Wisconsin Diagnostic Breast Cancer, 569 cases, 30 features, 2 classes. Model: standardized logistic regression. Split: 200 held-out test cases. Full-test accuracy in this run is 99.0%.

89.0%auto-execute with 0 observed errors

11.0%review at zero-error point

97.5%auto-execute at 1% target risk

0.51%observed error at that point

Risk coverage plot generated from the Wisconsin Diagnostic Breast Cancer dataset — CSV outputs: risk-coverage curve, operating points.

Conformal prediction-set run

Dataset: sklearn Digits, 1,797 handwritten digit images, 64 features, 10 classes. Model: standardized logistic regression. Split: train, 405 calibration cases, 450 test cases. Auto-execution means the conformal prediction set contains exactly one label.

1%alpha operating point

100%observed test coverage

68.9%singleton auto-execution

31.1%review rate

Conformal review budget plot from sklearn Digits — CSV output: digits_conformal_sets.csv.

Conformal empirical coverage plot from sklearn Digits — This is one random split, not a production guarantee. Production use needs rolling calibration and drift monitoring.

Real agent benchmark context

The site also includes a fetched SWE-bench Verified test split: 500 real GitHub issue-fix tasks from public repositories. This is not a router result, but it grounds the enterprise-agent story in an actual software-agent benchmark rather than only classifier proxies.

SWE-bench Verified difficulty distribution — CSV output: swebench_verified_difficulty_counts.csv.

Top repositories in SWE-bench Verified — CSV outputs: repo counts, task metrics.