Hyperparameters That Matter Most in Large Language Model Pretraining
Susannah Greenwood
Susannah Greenwood

I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.

2 Comments

  1. Sheila Alston Sheila Alston
    January 27, 2026 AT 06:09 AM

    Wow, finally someone who gets it. I’ve seen so many teams waste months because they just copy-pasted Hugging Face’s defaults. I worked on a 47B MoE model last year-used their suggested min_lr of 1e-7 and wondered why we kept plateauing at 4.2 perplexity. Changed it to 5e-5, retrained for 3 days, and dropped to 3.7. No magic, just math. Why do people still treat this like witchcraft?

  2. sampa Karjee sampa Karjee
    January 27, 2026 AT 19:23 PM

    Step Law? How quaint. I suppose you’re also using Newtonian mechanics to calculate orbital trajectories. This formula assumes linearity in non-linear systems, ignores gradient noise covariance, and completely disregards the curvature of the loss landscape. The 52% variance claim is statistically dubious-correlation isn’t causation, especially when you’ve cherry-picked 3,700 runs from a biased dataset. Real researchers use Bayesian optimization with adaptive priors, not back-of-the-envelope scaling laws from arXiv.

Write a comment