The fundamental challenge with pitch grading is this: individual pitches are noisy. A perfectly executed pitch can still result in a home run. A terrible pitch can still induce a weak groundout. The value of the models shows up when you aggregate grades over a season — noise cancels, skill emerges.
1. The Noise Problem
Imagine a pitcher throws a perfect slider: sharp break, ideal location, tunneled off the fastball. What happens next depends on dozens of factors the pitcher cannot control.
A Single Pitch: Controlled vs. Uncontrolled
The pitch was identical in both cases. The outcome was determined by what the batter did, not the pitcher. This is why judging a model based on any single pitch outcome is fundamentally flawed. The per-pitch R² is about 0.003 and that is expected, not a failure.
🎲
Analogy: Imagine rating a poker player by a single hand. They can play perfectly and still lose to a lucky river card. You need hundreds of hands to see if they are a winning player. Pitching works the same way: skill reveals itself over volume, never on any single pitch.
2. How Averaging Reveals Skill
The magic happens when you aggregate. As you average more pitches together, random noise cancels out and the real skill signal emerges. This is the central insight of our entire evaluation framework.
Interactive: Watch Noise Cancel Out
Sample size: 1 pitch
1 pitch50 pitches200 pitches500+ pitches
1 PITCH
R² = 0.003
Pure chaos. Outcome is dominated by batter, umpire, luck. The model's grade has almost zero predictive power on any single pitch.
50 PITCHES
Noise starts canceling
A faint pattern emerges. Better-graded pitches show slightly better average outcomes. But it's still shaky — a few unlucky pitches can skew the whole average.
200 PITCHES
Real skill emerges
The correlation tightens noticeably. Pitchers with higher grades consistently show better run prevention. This is roughly a month of starts for a starter.
500+ PITCHES
Clear signal
Full-season data. The correlation between model grades and actual outcomes is strong and stable. This is where we evaluate model quality: r = 0.53 for pitcher-season.
3. The Metric Hierarchy
Not all evaluation metrics are created equal. We use a strict hierarchy to decide whether a model change is an improvement. Getting this wrong means chasing noise instead of signal.
The Evaluation Pyramid
Why Seasonal Avg RV is the Gate
The question we are ultimately trying to answer is: "Over a full season, does this model correctly rank who threw better pitches?"
Seasonal average RV (run value) correlation answers this directly. It takes each pitcher's average model grade for a pitch type over the season and correlates it with their actual average run value over that same period.
Pitcher-PitchType-Season
r = 0.37
"Does this model correctly rank individual pitch quality?"
"Does this model correctly rank overall pitcher quality?"
e.g., "Is Pitcher A overall better than Pitcher B?"
4. Why Not FIP?
FIP (Fielding Independent Pitching) is a well-known metric, so why is it only a secondary reference? Because FIP only sees part of the picture.
What FIP Captures vs. What It Misses
What FIP Misses
Weak Contact
A pitcher who induces constant weak grounders creates enormous value. FIP literally cannot see this. Run value captures it fully.
Called Strikes
Pitches that expand the zone or freeze hitters looking. These change the count and create value even when the AB continues.
Foul Balls That Shift Counts
A nasty slider that gets fouled off on 0-2 keeps the pitcher ahead. The run expectancy shifts in the pitcher's favor, but FIP sees nothing.
Setup Pitch Value
A fastball that tunnels with the slider creates swing-and-miss later in the AB. Each pitch contributes to the outcome even if it is not the AB-ending pitch.
5. Sign Convention
This is a common source of confusion, so let us make it crystal clear.
From Raw xRV to Stuff+ Grade
The key takeaway: in the raw data, negative xRV = runs prevented = good for the pitcher. Statcast uses the same convention. The grades flip and scale this so that 100 = average and higher = better, which is the intuitive reading for end users.
6. What We Trust
Not all sample sizes are created equal. Here is our confidence framework for interpreting Stockyard grades.
Confidence by Sample Size
🌡️
Analogy: Checking the temperature once tells you almost nothing about the climate. Checking it every day for a year tells you everything. One hot day does not mean you live in a desert. One cold pitch grade does not mean the pitcher's stuff is bad.