The Open Model Test is a self-evaluation tool for pitch model makers, not a competition. A CSV benchmark fundamentally can't prevent cheating, so ranking submissions against each other would be meaningless. Here's why.
A submitter doesn't have to use the same model for every season. They could train one model optimized for the 2022→2023 outcome pair, another for 2023→2024, and so on. Since all outcome data is public and historical, someone could reverse-engineer "grades" that correlate perfectly with known outcomes. Forward prediction was supposed to prevent this, but it only works if the same model produces all grades — something a CSV-only benchmark cannot verify.
The alternative — submitting grades before outcomes are known — introduces a different problem. Models that factor in age regression, injury risk, workload decline, and other non-physics signals would outperform pure pitch-quality models. The benchmark would reward projection systems, which is the opposite of what a pitch model evaluation should measure.
Any set of pitch features rich enough to grade a pitch is rich enough to identify who threw it. Release point alone has ICC > 0.94 across pitchers. A sufficiently flexible model can memorize "this pitcher's trait cluster → known outcomes" and produce grades that look predictive but are really just identity lookup. Even an API-based approach where we send pitch physics without outcomes can be defeated by cross-referencing features against public Statcast data.
The fundamental limitation: from outputs alone, you cannot distinguish between a model that genuinely understands pitch physics, a lookup table of known outcomes, and a projection system wearing a pitch model costume. The only way to verify is to run the model yourself on controlled inputs — which would exclude proprietary models and make the benchmark inaccessible.
The evaluation framework is rigorous: forward xwOBA correlation, Marcel baseline comparison, command leakage detection, team-switcher portability, volume-tier stability, and per-pitch-type diagnostics. These diagnostics help modelers understand their model's strengths and blind spots. That's genuinely useful. A competitive leaderboard would just invite gaming. The people who want honest feedback don't need a leaderboard, and the people who want a leaderboard might not provide honest grades.
The diagnostics are still here
Submit your grades and get a full scorecard with forward correlations, command leakage analysis, team-switcher portability, and more. The evaluation framework is the same — it just doesn't rank you against others.
Go to the evaluation tool