How XGBoost Trees Work

And why early stopping saves Stuff+ from memorizing noise

Contents
1 A Single Decision Tree 2 Boosting: Trees That Learn from Mistakes 3 How Stuff+ Trains on Residuals 4 When Trees Start Memorizing 5 Early Stopping: The Safety Net 6 Putting It All Together

1. A Single Decision Tree

A decision tree is the simplest building block. It asks a series of yes/no questions about a pitch's features, splitting the data at each step until it reaches a prediction.

Example: One Tree Asking About a Fastball
Spin rate > 2400? First question Yes No IVB > 16 inches? Vertical movement Velocity > 95 mph? Effective speed Yes No Yes No -0.015 RV Elite nastiness -0.005 RV Above average +0.002 RV Average +0.012 RV Below average EACH LEAF = A PREDICTION Every pitch follows a path down the tree and lands in a leaf with a predicted residual RV

A single tree is weak. It can only carve the feature space into a few regions. The predictions at each leaf are just the average residual of whatever pitches landed there. One tree alone explains very little — maybe 0.1% of variance.

🎓
Analogy: One tree is like asking a scout a single question: "Is the spin rate high?" You get a rough yes/no grouping, but you miss velocity, movement, extension, and everything else. Each tree is a narrow specialist with a tiny piece of the picture.

2. Boosting: Trees That Learn from Mistakes

XGBoost doesn't build one tree — it builds hundreds of them sequentially. Each new tree focuses specifically on what the previous trees got wrong. This is called gradient boosting.

The Boosting Process
Tree 1 Rough groupings Errors What Tree 1 got wrong feed errors to Tree 2 Fixes Tree 1's errors Smaller Errors What Trees 1+2 still miss feed to Tree 3 ... Final Prediction Tree 1 prediction + Tree 2 correction + Tree 3 correction + ... + Tree N correction = Stuff+ (after scaling to 100 mean)
STEP 1
Tree 1 makes rough predictions
The first tree splits pitches by the single most informative feature (maybe spin rate). Predictions are crude — just average residual RV per bucket.
STEP 2
Calculate the errors
For each pitch, compute: error = actual residual - Tree 1's prediction. These errors become the target for Tree 2.
STEP 3
Tree 2 focuses on errors
Tree 2 doesn't re-learn everything. It only tries to predict what Tree 1 got wrong. Maybe it catches that high-extension pitches with moderate spin were misclassified.
STEP 4
Repeat, each tree smaller
Each successive tree corrects smaller and smaller errors. Tree 50 is fixing nuances Tree 49 missed. But each correction is multiplied by the learning rate (0.08) — so no single tree can dominate.
The Learning Rate — Why 0.08?

Each tree's correction is multiplied by 0.08 before being added to the total. This means Tree 2 doesn't fully fix Tree 1's errors — it fixes 8% of them. This forces the model to need many trees to converge, which makes it more robust.

Think of it like adjusting a telescope: small nudges (low learning rate) get you to focus more precisely than big jerks. But it also means you need more nudges (more trees) to get there.

3. How Stuff+ Trains on Residuals

Here's where Stockyard's approach is unique. Stuff+ doesn't train on raw xRV. It trains on what Location+ couldn't explain.

The Residual Pipeline
Raw xRV Per-pitch outcome (runs prevented) - Location+ Prediction "This pitch's location is worth X" Residual What location alone can't explain = pure pitch quality signal XGBoost 24 physics features spin, velocity, movement extension, IVB, etc. → Stuff+ grade This residual target is WHY Stuff+ needs early stopping The signal is real but tiny — the noise-to-signal ratio is very high
Why Residuals Are Noisy

A single pitch's outcome (even expected run value) is dominated by context: count, baserunners, batter handedness, park, umpire tendencies. After removing what Location+ explains, what's left is a mix of:

Real signal (tiny)
Pitch physics that make it harder to hit — spin efficiency, movement profile, tunneling potential
Noise (huge)
Whether the batter guessed right, timing, luck on contact, umpire bias, and everything else that's random at the per-pitch level

The per-pitch R² is about 0.003. That's not a bug — it's the nature of the problem. The real value shows up when you average over hundreds of pitches per pitcher, where the noise cancels out and the physics signal emerges.

4. When Trees Start Memorizing

Here's the core problem. Without constraints, XGBoost will keep building trees forever. The first ~50–150 trees learn real patterns. Everything after that is the model memorizing noise.

The Overfitting Story — Visualized
Number of trees (boosting rounds) 0 100 200 300 400 600 Prediction Error (Loss) Low High Training loss Validation loss Best stopping point Learning real patterns Memorizing noise Diminishing returns Gap = overfitting Training improves, validation worsens

The training loss always improves — the model can always fit the training data better. But the validation loss (performance on unseen data) eventually stops improving and starts getting worse. The gap between the two curves is the model memorizing training-specific quirks.

📚
Analogy: It's like a student studying for an exam. At first, studying helps — they learn real concepts. But at some point, extra studying means memorizing specific practice problems word-for-word rather than understanding the material. They'd ace the practice tests but bomb a new exam with different questions. Trees 1-100 are learning concepts. Trees 200-600 are memorizing practice problems.

What "Memorizing" Looks Like for Stuff+

Without early stopping, Tree #347 might learn something like:

IF spin_rate between 2347-2351
AND extension between 6.42-6.44
AND pfx_z between 14.8-14.9
THEN predict -0.003 RV correction

This isn't a real pattern — it's a coincidence that the ~12 pitches in the training data with these exact characteristics happened to have good outcomes. A new dataset would have completely different pitches in that tiny slice. The tree fit the noise, not the signal.

5. Early Stopping: The Safety Net

Early stopping watches the validation loss after every tree. If 50 consecutive trees fail to improve validation performance, it stops building. No more trees. Done.

Early Stopping In Action
Boosting rounds 0 29 80 151 600 Validation Loss 29 Slider: 29 trees Very noisy signal 80 Changeup: 80 trees Moderate signal 151 Fastball: 151 trees Strongest signal 600 ← Old config KEY INSIGHT Different pitch types have different signal strengths. Early stopping adapts to each.
Without Early Stopping

Always builds 600 trees

  • Slider model: 571 trees wasted memorizing noise
  • Fastball model: 449 trees wasted
  • All pitch types get same treatment regardless of signal strength
  • Overfits most on the noisiest pitch types
With Early Stopping

Stops when learning stops

  • Slider: stops at 29, before noise takes over
  • Fastball: runs to 151, extracts more real signal
  • Each pitch type gets exactly as many trees as the signal supports
  • Automatically adapts to signal strength
The "Patience" Parameter

The early_stopping_rounds=50 setting means: "keep going for 50 more trees after the last improvement, in case there's a late breakthrough." If 50 trees pass with no improvement, it's safe to say the model has learned everything it can from the signal.

50
patience rounds
29-151
actual stopping range
600
old max (unused)

The fact that models stop at 29-151 out of a possible 600 means 75-95% of the old trees were pure overfitting. The signal ran out fast — the residual target is just that noisy at the per-pitch level.

6. Putting It All Together

The Full Picture
1. Location+ trains 11 location features → xRV 400 trees, depth 4 2. Compute residuals residual = xRV - Location+ prediction 3. Stuff+ trains 24 physics features → residual Early stopping active Stops at 29-151 trees What Each Tree Group Learns Trees 1-10 The big splits: high spin + high velo = better, slow + flat = worse. Captures ~60% of what the model will ever learn. Trees 10-50 Nuances: extension matters for low-spin pitches, IVB interacts with arm slot, plate speed vs release speed differences. Trees 50-150 Fine-tuning: movement shape details, specific velocity bands where deception is highest. Only fastballs reach this range. Trees 150-600 ⚠ No real patterns left — just memorizing that pitch #847,291 with spin 2348.7 happened to produce a swinging strike. Signal Noise →
Bottom Line

Early stopping doesn't sacrifice accuracy — it improves it by cutting out the trees that were hurting generalization.

Stuff+ with 600 forced trees was like a scout who watches so much film that they start seeing patterns that aren't there. Early stopping is the scout's manager saying "you've seen enough — trust your read." The model stops at 29-151 trees because that's genuinely where the physics signal runs out in a per-pitch residual. Everything after that was the model lying to itself.

Stockyard Baseball — Model Explainers