Transformer Pre-Norm vs Post-Norm Architectures: Which One Keeps LLMs Stable?
Susannah Greenwood
Susannah Greenwood

I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.

8 Comments

  1. Wilda Mcgee Wilda Mcgee
    December 16, 2025 AT 18:33 PM

    Pre-norm is honestly a game-changer for deep models. I was struggling with my 78-layer model for weeks-crashes every other day, gradients vanishing like smoke. Switched to pre-norm, bumped the learning rate by 20%, set gradient clipping to 1.5, and boom-trained clean for 12 days straight. No more midnight panic emails to the team. Honestly, if you're building anything over 50 layers, don't even waste time with post-norm. It's like trying to run a marathon in concrete boots.

  2. Chris Atkins Chris Atkins
    December 18, 2025 AT 01:52 AM

    post norm still has its place honestly if you got a small model and want that extra 0.4 bleu point and dont care about the 3 weeks of tuning

  3. Jen Becker Jen Becker
    December 18, 2025 AT 11:58 AM

    pre-norm causes massive activations? that's just code rot. someone's doing it wrong.

  4. Ryan Toporowski Ryan Toporowski
    December 19, 2025 AT 02:22 AM

    Yessss this is the real talk 😊 I switched from post-norm to pre-norm last month and my training time dropped from 14 days to 5. I was skeptical but now I’m evangelizing it to everyone. Just remember: crank up that LR and clamp those gradients. And yes, I’ve seen those activation spikes too-logging max activation every 100 steps saved my sanity. 🚀

  5. Samuel Bennett Samuel Bennett
    December 20, 2025 AT 06:40 AM

    Correction: the paper says gradient norm drops by 1/sqrt(layer), not 15% at layer 50. It's actually around 14.14%. Also, you said 'Google’s PaLM uses pre-norm'-wrong. PaLM uses a modified variant called 'pre-LN with residual scaling.' And you didn't mention that post-norm doesn't need gradient clipping because it naturally suppresses outliers. You're oversimplifying. Again.

    Also, 'Hugging Face defaults to pre-norm'? No. They default to what the model config says. Llama uses pre-norm. BERT uses post-norm. You're conflating defaults with common practice. This post reads like a Medium article written by someone who skimmed the abstracts.

  6. Rob D Rob D
    December 21, 2025 AT 00:51 AM

    Look, if you're still using post-norm you're basically clinging to 2017 tech like it's the American flag. Pre-norm is the future, and anyone who says otherwise is either a academic with tenure or someone who hasn't trained a real model since 2020. I trained a 120-layer model on a single A100-pre-norm, no warmup, no magic, just 1.8 gradient clip and 3e-4 LR. Post-norm? You'd need a cluster the size of a small country and a PhD in optimization just to get it to not explode. And don't even get me started on that 'representation collapse' nonsense-sounds like someone who didn't monitor their latent space. Real engineers don't cry about activation norms, they clamp them and move on.

    Also, Peri-LN? That's just pre-norm with extra steps. And adaptive normalization? Sounds like a marketing buzzword. If it ain't broke, don't fix it-but post-norm is broke. Period.

  7. Franklin Hooper Franklin Hooper
    December 21, 2025 AT 12:10 PM

    It's amusing how casually people cite 'Google researchers found...' without linking to the actual paper. The 2024 activation explosion claim? That's from an internal Google blog post, not peer-reviewed. The 22% memory increase? Based on a single Hugging Face benchmark using a non-standard sequence length. And 'representation collapse'-a term coined in a 2024 arXiv preprint with no validation across datasets. The entire narrative here is built on anecdotal Reddit posts and speculative blog summaries.

    Pre-norm is not inherently superior. It's merely more forgiving to under-tuned setups. That's why it's popular in industry-because most teams lack the expertise to properly train post-norm. But that doesn't make post-norm inferior. It makes the practitioners inferior.

    Also, you say 'every major LLM since 2022 uses pre-norm.' Llama 2? Post-norm. Mistral? Post-norm. Gemma? Post-norm. You're misrepresenting the landscape to sell a narrative. This is dangerously misleading.

  8. Jess Ciro Jess Ciro
    December 22, 2025 AT 10:43 AM

    Pre-norm is a trap. The real reason it 'works' is because it hides the model's instability behind gradient clipping and higher learning rates. It's like putting a bandaid on a broken leg and calling it a fix. The 'massive activations' aren't a bug-they're a warning. Your model is learning nothing but scale. The gradients are flowing, sure, but the representations are collapsing. You think you're training a 100-layer transformer? You're training a 5-layer transformer with noise amplification.

    And don't get me started on 'adaptive normalization.' That's just corporate jargon for 'we don't know what we're doing so we'll let the algorithm guess.' The real breakthroughs are in layer-wise attention routing, not normalization gymnastics. This whole thread is a distraction. The real problem? We're stacking layers like LEGO bricks and pretending depth equals intelligence. We're not building brains. We're building glorified echo chambers with weights.

    Post-norm forces you to understand your model. Pre-norm lets you pretend you don't need to. That's why academia still uses it. Because they care about results, not just convergence.

Write a comment