How XGBoost Trees Work

And why early stopping saves Stuff+ from memorizing noise

Contents

1 A Single Decision Tree 2 Boosting: Trees That Learn from Mistakes 3 How Stuff+ Trains on Residuals 4 When Trees Start Memorizing 5 Early Stopping: The Safety Net 6 Putting It All Together

1. A Single Decision Tree

A decision tree is the simplest building block. It asks a series of yes/no questions about a pitch's features, splitting the data at each step until it reaches a prediction.

Example: One Tree Asking About a Fastball

A single tree is weak. It can only carve the feature space into a few regions. The predictions at each leaf are just the average residual of whatever pitches landed there. One tree alone explains very little — maybe 0.1% of variance.

🎓

Analogy: One tree is like asking a scout a single question: "Is the spin rate high?" You get a rough yes/no grouping, but you miss velocity, movement, extension, and everything else. Each tree is a narrow specialist with a tiny piece of the picture.

2. Boosting: Trees That Learn from Mistakes

XGBoost doesn't build one tree — it builds hundreds of them sequentially. Each new tree focuses specifically on what the previous trees got wrong. This is called gradient boosting.

The Boosting Process

STEP 1

Tree 1 makes rough predictions

The first tree splits pitches by the single most informative feature (maybe spin rate). Predictions are crude — just average residual RV per bucket.

STEP 2

Calculate the errors

For each pitch, compute: error = actual residual - Tree 1's prediction. These errors become the target for Tree 2.

STEP 3

Tree 2 focuses on errors

Tree 2 doesn't re-learn everything. It only tries to predict what Tree 1 got wrong. Maybe it catches that high-extension pitches with moderate spin were misclassified.

STEP 4

Repeat, each tree smaller

Each successive tree corrects smaller and smaller errors. Tree 50 is fixing nuances Tree 49 missed. But each correction is multiplied by the learning rate (0.08) — so no single tree can dominate.

The Learning Rate — Why 0.08?

Each tree's correction is multiplied by 0.08 before being added to the total. This means Tree 2 doesn't fully fix Tree 1's errors — it fixes 8% of them. This forces the model to need many trees to converge, which makes it more robust.

Think of it like adjusting a telescope: small nudges (low learning rate) get you to focus more precisely than big jerks. But it also means you need more nudges (more trees) to get there.

3. How Stuff+ Trains on Residuals

Here's where Stockyard's approach is unique. Stuff+ doesn't train on raw xRV. It trains on what Location+ couldn't explain.

The Residual Pipeline

Why Residuals Are Noisy

A single pitch's outcome (even expected run value) is dominated by context: count, baserunners, batter handedness, park, umpire tendencies. After removing what Location+ explains, what's left is a mix of:

Real signal (tiny)

Pitch physics that make it harder to hit — spin efficiency, movement profile, tunneling potential

Noise (huge)

Whether the batter guessed right, timing, luck on contact, umpire bias, and everything else that's random at the per-pitch level

The per-pitch R² is about 0.003. That's not a bug — it's the nature of the problem. The real value shows up when you average over hundreds of pitches per pitcher, where the noise cancels out and the physics signal emerges.

4. When Trees Start Memorizing

Here's the core problem. Without constraints, XGBoost will keep building trees forever. The first ~50–150 trees learn real patterns. Everything after that is the model memorizing noise.

The Overfitting Story — Visualized

The training loss always improves — the model can always fit the training data better. But the validation loss (performance on unseen data) eventually stops improving and starts getting worse. The gap between the two curves is the model memorizing training-specific quirks.

📚

Analogy: It's like a student studying for an exam. At first, studying helps — they learn real concepts. But at some point, extra studying means memorizing specific practice problems word-for-word rather than understanding the material. They'd ace the practice tests but bomb a new exam with different questions. Trees 1-100 are learning concepts. Trees 200-600 are memorizing practice problems.

What "Memorizing" Looks Like for Stuff+

Without early stopping, Tree #347 might learn something like:

    IF spin_rate between 2347-2351

    AND extension between 6.42-6.44

    AND pfx_z between 14.8-14.9

    THEN predict -0.003 RV correction

This isn't a real pattern — it's a coincidence that the ~12 pitches in the training data with these exact characteristics happened to have good outcomes. A new dataset would have completely different pitches in that tiny slice. The tree fit the noise, not the signal.

5. Early Stopping: The Safety Net

Early stopping watches the validation loss after every tree. If 50 consecutive trees fail to improve validation performance, it stops building. No more trees. Done.

Early Stopping In Action

Without Early Stopping

Always builds 600 trees

Slider model: 571 trees wasted memorizing noise
Fastball model: 449 trees wasted
All pitch types get same treatment regardless of signal strength
Overfits most on the noisiest pitch types

With Early Stopping

Stops when learning stops

Slider: stops at 29, before noise takes over
Fastball: runs to 151, extracts more real signal
Each pitch type gets exactly as many trees as the signal supports
Automatically adapts to signal strength

The "Patience" Parameter

The early_stopping_rounds=50 setting means: "keep going for 50 more trees after the last improvement, in case there's a late breakthrough." If 50 trees pass with no improvement, it's safe to say the model has learned everything it can from the signal.

patience rounds

29-151

actual stopping range

600

old max (unused)

The fact that models stop at 29-151 out of a possible 600 means 75-95% of the old trees were pure overfitting. The signal ran out fast — the residual target is just that noisy at the per-pitch level.

6. Putting It All Together

The Full Picture

Bottom Line

Early stopping doesn't sacrifice accuracy — it improves it by cutting out the trees that were hurting generalization.

Stuff+ with 600 forced trees was like a scout who watches so much film that they start seeing patterns that aren't there. Early stopping is the scout's manager saying "you've seen enough — trust your read." The model stops at 29-151 trees because that's genuinely where the physics signal runs out in a per-pitch residual. Everything after that was the model lying to itself.

Stockyard Baseball — Model Explainers