How We Evaluate Model Quality

The art of finding real signal in a sea of randomness

Contents

1 The Noise Problem 2 How Averaging Reveals Skill 3 The Metric Hierarchy 4 Why Not FIP? 5 Sign Convention 6 What We Trust

The fundamental challenge with pitch grading is this: individual pitches are noisy. A perfectly executed pitch can still result in a home run. A terrible pitch can still induce a weak groundout. The value of the models shows up when you aggregate grades over a season — noise cancels, skill emerges.

1. The Noise Problem

Imagine a pitcher throws a perfect slider: sharp break, ideal location, tunneled off the fastball. What happens next depends on dozens of factors the pitcher cannot control.

A Single Pitch: Controlled vs. Uncontrolled

The pitch was identical in both cases. The outcome was determined by what the batter did, not the pitcher. This is why judging a model based on any single pitch outcome is fundamentally flawed. The per-pitch R² is about 0.003 and that is expected, not a failure.

🎲

Analogy: Imagine rating a poker player by a single hand. They can play perfectly and still lose to a lucky river card. You need hundreds of hands to see if they are a winning player. Pitching works the same way: skill reveals itself over volume, never on any single pitch.

2. How Averaging Reveals Skill

The magic happens when you aggregate. As you average more pitches together, random noise cancels out and the real skill signal emerges. This is the central insight of our entire evaluation framework.

Interactive: Watch Noise Cancel Out

Sample size: 1 pitch

1 pitch 50 pitches 200 pitches 500+ pitches

1 PITCH

R² = 0.003

Pure chaos. Outcome is dominated by batter, umpire, luck. The model's grade has almost zero predictive power on any single pitch.

50 PITCHES

Noise starts canceling

A faint pattern emerges. Better-graded pitches show slightly better average outcomes. But it's still shaky — a few unlucky pitches can skew the whole average.

200 PITCHES

Real skill emerges

The correlation tightens noticeably. Pitchers with higher grades consistently show better run prevention. This is roughly a month of starts for a starter.

500+ PITCHES

Clear signal

Full-season data. The correlation between model grades and actual outcomes is strong and stable. This is where we evaluate model quality: r = 0.53 for pitcher-season.

3. The Metric Hierarchy

Not all evaluation metrics are created equal. We use a strict hierarchy to decide whether a model change is an improvement. Getting this wrong means chasing noise instead of signal.

The Evaluation Pyramid

Why Seasonal Avg RV is the Gate

The question we are ultimately trying to answer is: "Over a full season, does this model correctly rank who threw better pitches?"

Seasonal average RV (run value) correlation answers this directly. It takes each pitcher's average model grade for a pitch type over the season and correlates it with their actual average run value over that same period.

Pitcher-PitchType-Season

r = 0.37

"Does this model correctly rank individual pitch quality?"

e.g., "Is Pitcher A's slider graded higher than Pitcher B's slider?"

Pitcher-Season

r = 0.53

"Does this model correctly rank overall pitcher quality?"

e.g., "Is Pitcher A overall better than Pitcher B?"

4. Why Not FIP?

FIP (Fielding Independent Pitching) is a well-known metric, so why is it only a secondary reference? Because FIP only sees part of the picture.

What FIP Captures vs. What It Misses

What FIP Misses

Weak Contact

A pitcher who induces constant weak grounders creates enormous value. FIP literally cannot see this. Run value captures it fully.

Called Strikes

Pitches that expand the zone or freeze hitters looking. These change the count and create value even when the AB continues.

Foul Balls That Shift Counts

A nasty slider that gets fouled off on 0-2 keeps the pitcher ahead. The run expectancy shifts in the pitcher's favor, but FIP sees nothing.

Setup Pitch Value

A fastball that tunnels with the slider creates swing-and-miss later in the AB. Each pitch contributes to the outcome even if it is not the AB-ending pitch.

5. Sign Convention

This is a common source of confusion, so let us make it crystal clear.

From Raw xRV to Stuff+ Grade

The key takeaway: in the raw data, negative xRV = runs prevented = good for the pitcher. Statcast uses the same convention. The grades flip and scale this so that 100 = average and higher = better, which is the intuitive reading for end users.

6. What We Trust

Not all sample sizes are created equal. Here is our confidence framework for interpreting Stockyard grades.

Confidence by Sample Size

🌡️

Analogy: Checking the temperature once tells you almost nothing about the climate. Checking it every day for a year tells you everything. One hot day does not mean you live in a desert. One cold pitch grade does not mean the pitcher's stuff is bad.

Current Pitching+ Performance Summary

0.37

Pitch-Type-Season

avg RV correlation

0.53

Pitcher-Season

avg RV correlation

0.51

FIP

correlation (reference)

0.003

Per-Pitch R²

expected to be tiny

Stockyard Baseball — Model Explainers