Hyperparameters That Matter Most in Large Language Model Pretraining
Susannah Greenwood
Susannah Greenwood

I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.

5 Comments

  1. Sheila Alston Sheila Alston
    January 27, 2026 AT 06:09 AM

    Wow, finally someone who gets it. I’ve seen so many teams waste months because they just copy-pasted Hugging Face’s defaults. I worked on a 47B MoE model last year-used their suggested min_lr of 1e-7 and wondered why we kept plateauing at 4.2 perplexity. Changed it to 5e-5, retrained for 3 days, and dropped to 3.7. No magic, just math. Why do people still treat this like witchcraft?

  2. sampa Karjee sampa Karjee
    January 27, 2026 AT 19:23 PM

    Step Law? How quaint. I suppose you’re also using Newtonian mechanics to calculate orbital trajectories. This formula assumes linearity in non-linear systems, ignores gradient noise covariance, and completely disregards the curvature of the loss landscape. The 52% variance claim is statistically dubious-correlation isn’t causation, especially when you’ve cherry-picked 3,700 runs from a biased dataset. Real researchers use Bayesian optimization with adaptive priors, not back-of-the-envelope scaling laws from arXiv.

  3. Patrick Sieber Patrick Sieber
    January 29, 2026 AT 14:56 PM

    Man, I love how this post cuts through the noise. I’ve been in the trenches with 80B models and let me tell you-the min_lr trap is everywhere. We had a client who was using a decay to 1e-7 and swearing their model was "converging fine." Turned out it was just frozen in a shallow valley. Changed to 2e-5, same compute, same data, and perplexity dropped 0.4. No drama. No tuning. Just follow the formula. Also, batch size scaling with D^0.75? That’s the golden nugget. Stop thinking bigger batch = better. It’s about signal-to-noise, not volume.

  4. Kieran Danagher Kieran Danagher
    January 30, 2026 AT 16:18 PM

    So you're telling me the secret to training LLMs isn't AI, isn't magic, isn't 17 layers of hyperparameter tuning... it's a fucking calculator? I'm shocked. Absolutely stunned. Next you'll tell me water is wet and the sun rises in the east. Guess I'll go delete my Optuna config and just do the math. Thanks for making me feel like a clown who spent $200k on a fancy tuner while the answer was in a 2025 arXiv paper.

  5. OONAGH Ffrench OONAGH Ffrench
    January 31, 2026 AT 05:39 AM

    Math is the quiet architect of intelligence. The model doesn’t care about your optimizer name or your batch size ritual. It only responds to the gradient. The learning rate is the pulse. The batch size is the rhythm. Everything else is noise. We have spent decades building temples to complexity when the answer was always in the scaling. Let the numbers speak. Let the data breathe. Let the model learn. That’s all it ever needed.

Write a comment