How We Evaluate Model Quality

The art of finding real signal in a sea of randomness

Contents
1 The Noise Problem 2 How Averaging Reveals Skill 3 The Metric Hierarchy 4 Why Not FIP? 5 Sign Convention 6 What We Trust

The fundamental challenge with pitch grading is this: individual pitches are noisy. A perfectly executed pitch can still result in a home run. A terrible pitch can still induce a weak groundout. The value of the models shows up when you aggregate grades over a season — noise cancels, skill emerges.

1. The Noise Problem

Imagine a pitcher throws a perfect slider: sharp break, ideal location, tunneled off the fastball. What happens next depends on dozens of factors the pitcher cannot control.

A Single Pitch: Controlled vs. Uncontrolled
THE PITCH WHAT THE PITCHER CONTROLS Velocity 96.3 mph Spin & Movement 2847 rpm, 14" IVB Location Low-and-away corner Release Point 6.5 ft extension Sequencing After 3 fastballs up This is what our models grade WHAT THE PITCHER CANNOT CONTROL Batter Timing Guessed right? Late? Early? Umpire Call Strike 3 or ball 2? Contact Quality Barrel? Foul tip? Weak grounder? Fielder Positioning Shift on? Outfielder depth? Environment Wind, altitude, temperature This is what makes outcomes random Home Run Batter guessed slider, timed it perfectly Swinging Strike Batter sat fastball, fooled completely Same pitch. Different outcome.

The pitch was identical in both cases. The outcome was determined by what the batter did, not the pitcher. This is why judging a model based on any single pitch outcome is fundamentally flawed. The per-pitch R² is about 0.003 and that is expected, not a failure.

🎲
Analogy: Imagine rating a poker player by a single hand. They can play perfectly and still lose to a lucky river card. You need hundreds of hands to see if they are a winning player. Pitching works the same way: skill reveals itself over volume, never on any single pitch.

2. How Averaging Reveals Skill

The magic happens when you aggregate. As you average more pitches together, random noise cancels out and the real skill signal emerges. This is the central insight of our entire evaluation framework.

Interactive: Watch Noise Cancel Out
Sample size: 1 pitch
1 pitch 50 pitches 200 pitches 500+ pitches
Model Predicted Grade (Stuff+) Actual Run Value CORRELATION r = 0.05 Looks useless
1 PITCH
R² = 0.003
Pure chaos. Outcome is dominated by batter, umpire, luck. The model's grade has almost zero predictive power on any single pitch.
50 PITCHES
Noise starts canceling
A faint pattern emerges. Better-graded pitches show slightly better average outcomes. But it's still shaky — a few unlucky pitches can skew the whole average.
200 PITCHES
Real skill emerges
The correlation tightens noticeably. Pitchers with higher grades consistently show better run prevention. This is roughly a month of starts for a starter.
500+ PITCHES
Clear signal
Full-season data. The correlation between model grades and actual outcomes is strong and stable. This is where we evaluate model quality: r = 0.53 for pitcher-season.

3. The Metric Hierarchy

Not all evaluation metrics are created equal. We use a strict hierarchy to decide whether a model change is an improvement. Getting this wrong means chasing noise instead of signal.

The Evaluation Pyramid
PRIMARY GATE Seasonal Avg RV Correlation Pitcher-PitchType-Season: r = 0.37 Pitcher-Season: r = 0.53 Min 100 pitches per group SECONDARY REFERENCE FIP Correlation r = 0.51 Useful reference, but biased toward K/BB/HR Misses setup pitch value and weak contact INFORMATIONAL ONLY Per-Pitch R² R² = ~0.003 Misleading if used as a gate — expected to be tiny PROMOTES OR BLOCKS CHANGES Must improve to ship SANITY CHECK Should not get worse IGNORE Expected to be tiny THE RULE A model change ships ONLY if seasonal avg RV correlation improves. Everything else is supplementary.

Why Seasonal Avg RV is the Gate

The question we are ultimately trying to answer is: "Over a full season, does this model correctly rank who threw better pitches?"

Seasonal average RV (run value) correlation answers this directly. It takes each pitcher's average model grade for a pitch type over the season and correlates it with their actual average run value over that same period.

Pitcher-PitchType-Season
r = 0.37
"Does this model correctly rank individual pitch quality?"
e.g., "Is Pitcher A's slider graded higher than Pitcher B's slider?"
Pitcher-Season
r = 0.53
"Does this model correctly rank overall pitcher quality?"
e.g., "Is Pitcher A overall better than Pitcher B?"

4. Why Not FIP?

FIP (Fielding Independent Pitching) is a well-known metric, so why is it only a secondary reference? Because FIP only sees part of the picture.

What FIP Captures vs. What It Misses
FIP Only PA-ending events K BB HR Everything else Captures ~35-40% vs. Seasonal Avg RV Every pitch outcome K BB HR Weak contact Called strikes Fouls Setup value Captures ~100% FIP: r = 0.51   •   Avg RV: r = 0.53 Avg RV wins despite FIP being pitcher-season level too
What FIP Misses
Weak Contact
A pitcher who induces constant weak grounders creates enormous value. FIP literally cannot see this. Run value captures it fully.
Called Strikes
Pitches that expand the zone or freeze hitters looking. These change the count and create value even when the AB continues.
Foul Balls That Shift Counts
A nasty slider that gets fouled off on 0-2 keeps the pitcher ahead. The run expectancy shifts in the pitcher's favor, but FIP sees nothing.
Setup Pitch Value
A fastball that tunnels with the slider creates swing-and-miss later in the AB. Each pitch contributes to the outcome even if it is not the AB-ending pitch.

5. Sign Convention

This is a common source of confusion, so let us make it crystal clear.

From Raw xRV to Stuff+ Grade
Raw xRV (Expected Run Value) -0.015 = GREAT pitch +0.010 = BAD pitch negate Negated Flip the sign +0.015 -0.010 scale Stuff+ Grade 100 = average, higher = better 130 = elite prevented 30% more runs than avg 80 = below avg gave up 20% more runs than avg EXAMPLE TRANSFORMATION -0.010 xRV good: prevented runs +0.010 negated 110 scaled: 100 + (0.010 / stdev) * 10 HIGHER = BETTER

The key takeaway: in the raw data, negative xRV = runs prevented = good for the pitcher. Statcast uses the same convention. The grades flip and scale this so that 100 = average and higher = better, which is the intuitive reading for end users.

6. What We Trust

Not all sample sizes are created equal. Here is our confidence framework for interpreting Stockyard grades.

Confidence by Sample Size
0 ~25 50 100 300 500+ HIGH CONFIDENCE Season-level grades with 100+ pitches Full season data. Noise has largely canceled. These grades reflect real skill differences between pitchers. Use for rankings, comparisons, and talent evaluation. MODERATE CONFIDENCE Monthly grades with 50-100 pitches Useful for trend spotting. Can identify if a pitcher's stuff is trending up or down. But a single bad outing can skew the whole month. Directional, not definitive. LOW CONFIDENCE Game-level or per-start grades Interesting as a snapshot but dominated by noise. A 140 Stuff+ game does not mean the pitcher was elite that day — it might mean 3 lucky whiffs on mediocre pitches. DO NOT TRUST Small samples under 50 pitches Not enough data to separate signal from noise. A reliever with 30 pitches graded at 70 might actually be average — or even good. The sample is simply too small to know.
🌡️
Analogy: Checking the temperature once tells you almost nothing about the climate. Checking it every day for a year tells you everything. One hot day does not mean you live in a desert. One cold pitch grade does not mean the pitcher's stuff is bad.
Current Pitching+ Performance Summary
0.37
Pitch-Type-Season
avg RV correlation
0.53
Pitcher-Season
avg RV correlation
0.51
FIP
correlation (reference)
0.003
Per-Pitch R²
expected to be tiny

Stockyard Baseball — Model Explainers