Score your pitch grading model against next-year outcomes. Upload grades, get a full diagnostic scorecard — forward correlations, Marcel gate, command leakage checks, and more.
Why this isn't a leaderboard. A CSV benchmark can't prevent cheating. Submissions can be historically overfit — someone can train a different model per season against known outcomes. And for prospective tests, models that include projection features unrelated to pitch physics (age regression, injury risk, workload trends) will outperform pure pitch-quality models, rewarding the wrong thing. There's no way to verify from outputs alone whether grades came from pitch physics or a projection system. Full explanation →
This is a diagnostic tool, not a competition. We built pitch grading models, ran into the same evaluation problems everyone else has, and built a standardized scorecard. Submit your grades, get honest diagnostics about where your model works and where it doesn't. No rankings, no leaderboard — just useful feedback for model makers.
Every public pitch model reports different validation metrics, measured against different populations, over different time periods, using different minimum-IP thresholds. Comparing them is basically impossible.
There's no shared test. Nobody submits their grades to a common evaluation. So “our model correlates at 0.45 with FIP” is unfalsifiable — it depends on how you filter the population, which year you test, and whether you test same-year or forward.
The Open Model Test gives you a standardized diagnostic framework: forward prediction against next-year outcomes, Marcel baseline comparison, command leakage detection, team-switcher portability, and more. It's designed to help you honestly assess your own model's strengths and blind spots — not to compare against other submissions.
One test that catches both bad models and overfit models: Year N grades must predict Year N+1 outcomes.
If you overfit to 2024, your 2025 predictions will be bad. Forward prediction catches most overfitting naturally. The one failure mode it doesn't catch on its own is identity leakage — when a model implicitly memorizes pitcher trait clusters rather than learning physics. Since the same pitcher shows up next year with similar traits, memorization looks like prediction. The benchmark includes a dedicated diagnostic for this (see Debut-Pitcher Diagnostic below).
Submit a CSV with one row per pitcher-pitch type-season-hand combination. Each pitcher gets grades for each pitch type they throw, split by batter handedness. Higher grade = better. The benchmark covers three grade seasons — 2022, 2023, and 2024 — evaluated against 2023, 2024, and 2025 outcomes respectively. All three years go in one CSV.
| Column | Type | Description |
|---|---|---|
| mlbam_id | int | MLB Advanced Media pitcher ID (Statcast pitcher field) |
| year | int | Grade season: 2022, 2023, or 2024 |
| batter_hand | str | L or R — the batter handedness the grade applies to |
| pitch_type | str | Statcast pitch classification (FF, SI, FT, SL, CH, CU, FC, ST, FS, KC, SV) — FT (two-seam) is accepted and mapped to SI internally |
| stuff_grade | float | Pitch quality grade (physics / movement / “nastiness”) |
| pitching_grade | float | Overall pitching grade (stuff + command + context) |
Why per pitch type? A real pitch model produces different grades for a pitcher's fastball vs their slider. A projection system that assigns one number per pitcher cannot distinguish a pitcher's best pitch from their worst. Requiring pitch-type-level grades is the single most effective filter against “projection systems in a trenchcoat” — submissions that predict outcomes well by modeling the pitcher, not the pitches.
Why split by handedness? Platoon effects are one of the strongest signals in baseball. A model that averages across batter hands is throwing away information. Splitting forces models to capture real matchup skill rather than blending two different pitcher profiles into one number.
Why two grades? Stuff and Pitching measure different things. A pitcher can throw nasty pitches (high Stuff) but put them in bad locations (low Pitching), or vice versa. Models that only produce one grade can submit the same value for both columns — they'll just score lower if the single grade can't capture both dimensions.
No model internals required. Participants submit grades, not code. The benchmark aggregates pitch-type grades to pitcher-hand level internally for the primary evaluation using usage-weighted average (fraction of pitches thrown per pitch type). The evaluator validates the schema: one row per (mlbam_id, year, batter_hand, pitch_type) combination, no duplicates, no nulls. Submissions must also cover every pitcher who threw at least 150 pitches in the grade year — omitting pitchers from the required population is not permitted. The required population list is published alongside the frozen Marcel baselines for each season.
Full diagnostic scorecard. Every submission gets correlations across all outcome metrics, volume tiers, team-switcher results, and forward-to-same-year ratios. Marcel baseline is shown as a reference point.
Marcel baseline comparison. Your forward xwOBA correlation is compared against a Marcel projection baseline — a useful reference for whether your model adds signal beyond a simple weighted-average projection.
Raw grades stay private. Pitcher-by-pitcher grades are never published. Only aggregate evaluation metrics are shown, so commercial models can participate without giving away their output.
Score any time. Submit your model against historical outcomes whenever you want. Use this to iterate on your model, test a new feature, or diagnose a specific weakness. The benchmark data is not a secret holdout. The diagnostics — team-switcher retention, debut-pitcher gap, variance filter — are designed to help you understand what your model actually learned.
Upload a CSV with your 2022–2024 pitch grades and get an instant scorecard — forward correlations, Marcel gate, command leakage, and more. Download your results as CSV when you're done.
We're looking for feedback from anyone who builds, studies, or uses pitch grading models. What are we missing? What's wrong? What would make you actually want to submit your model's grades to a benchmark like this?
Prefer email? Reach us at info@stockyardbaseball.com — or DM @Robertstock6 on X