A decision tree is the simplest building block. It asks a series of yes/no questions about a pitch's features, splitting the data at each step until it reaches a prediction.
Example: One Tree Asking About a Fastball
A single tree is weak. It can only carve the feature space into a few regions. The predictions at each leaf are just the average residual of whatever pitches landed there. One tree alone explains very little — maybe 0.1% of variance.
🎓
Analogy: One tree is like asking a scout a single question: "Is the spin rate high?" You get a rough yes/no grouping, but you miss velocity, movement, extension, and everything else. Each tree is a narrow specialist with a tiny piece of the picture.
2. Boosting: Trees That Learn from Mistakes
XGBoost doesn't build one tree — it builds hundreds of them sequentially. Each new tree focuses specifically on what the previous trees got wrong. This is called gradient boosting.
The Boosting Process
STEP 1
Tree 1 makes rough predictions
The first tree splits pitches by the single most informative feature (maybe spin rate). Predictions are crude — just average residual RV per bucket.
STEP 2
Calculate the errors
For each pitch, compute: error = actual residual - Tree 1's prediction. These errors become the target for Tree 2.
STEP 3
Tree 2 focuses on errors
Tree 2 doesn't re-learn everything. It only tries to predict what Tree 1 got wrong. Maybe it catches that high-extension pitches with moderate spin were misclassified.
STEP 4
Repeat, each tree smaller
Each successive tree corrects smaller and smaller errors. Tree 50 is fixing nuances Tree 49 missed. But each correction is multiplied by the learning rate (0.08) — so no single tree can dominate.
The Learning Rate — Why 0.08?
Each tree's correction is multiplied by 0.08 before being added to the total. This means Tree 2 doesn't fully fix Tree 1's errors — it fixes 8% of them. This forces the model to need many trees to converge, which makes it more robust.
Think of it like adjusting a telescope: small nudges (low learning rate) get you to focus more precisely than big jerks. But it also means you need more nudges (more trees) to get there.
3. How Stuff+ Trains on Residuals
Here's where Stockyard's approach is unique. Stuff+ doesn't train on raw xRV. It trains on what Location+ couldn't explain.
The Residual Pipeline
Why Residuals Are Noisy
A single pitch's outcome (even expected run value) is dominated by context: count, baserunners, batter handedness, park, umpire tendencies. After removing what Location+ explains, what's left is a mix of:
Real signal (tiny)
Pitch physics that make it harder to hit — spin efficiency, movement profile, tunneling potential
Noise (huge)
Whether the batter guessed right, timing, luck on contact, umpire bias, and everything else that's random at the per-pitch level
The per-pitch R² is about 0.003. That's not a bug — it's the nature of the problem. The real value shows up when you average over hundreds of pitches per pitcher, where the noise cancels out and the physics signal emerges.
4. When Trees Start Memorizing
Here's the core problem. Without constraints, XGBoost will keep building trees forever. The first ~50–150 trees learn real patterns. Everything after that is the model memorizing noise.
The Overfitting Story — Visualized
The training loss always improves — the model can always fit the training data better. But the validation loss (performance on unseen data) eventually stops improving and starts getting worse. The gap between the two curves is the model memorizing training-specific quirks.
📚
Analogy: It's like a student studying for an exam. At first, studying helps — they learn real concepts. But at some point, extra studying means memorizing specific practice problems word-for-word rather than understanding the material. They'd ace the practice tests but bomb a new exam with different questions. Trees 1-100 are learning concepts. Trees 200-600 are memorizing practice problems.
What "Memorizing" Looks Like for Stuff+
Without early stopping, Tree #347 might learn something like:
IFspin_ratebetween2347-2351 ANDextensionbetween6.42-6.44 ANDpfx_zbetween14.8-14.9 THEN predict-0.003 RV correction
This isn't a real pattern — it's a coincidence that the ~12 pitches in the training data with these exact characteristics happened to have good outcomes. A new dataset would have completely different pitches in that tiny slice. The tree fit the noise, not the signal.
5. Early Stopping: The Safety Net
Early stopping watches the validation loss after every tree. If 50 consecutive trees fail to improve validation performance, it stops building. No more trees. Done.
Early Stopping In Action
Without Early Stopping
Always builds 600 trees
Slider model: 571 trees wasted memorizing noise
Fastball model: 449 trees wasted
All pitch types get same treatment regardless of signal strength
Overfits most on the noisiest pitch types
With Early Stopping
Stops when learning stops
Slider: stops at 29, before noise takes over
Fastball: runs to 151, extracts more real signal
Each pitch type gets exactly as many trees as the signal supports
Automatically adapts to signal strength
The "Patience" Parameter
The early_stopping_rounds=50 setting means: "keep going for 50 more trees after the last improvement, in case there's a late breakthrough." If 50 trees pass with no improvement, it's safe to say the model has learned everything it can from the signal.
50
patience rounds
29-151
actual stopping range
600
old max (unused)
The fact that models stop at 29-151 out of a possible 600 means 75-95% of the old trees were pure overfitting. The signal ran out fast — the residual target is just that noisy at the per-pitch level.
6. Putting It All Together
The Full Picture
Bottom Line
Early stopping doesn't sacrifice accuracy — it improves it by cutting out the trees that were hurting generalization.
Stuff+ with 600 forced trees was like a scout who watches so much film that they start seeing patterns that aren't there. Early stopping is the scout's manager saying "you've seen enough — trust your read." The model stops at 29-151 trees because that's genuinely where the physics signal runs out in a per-pitch residual. Everything after that was the model lying to itself.