In a Training Loop 🔄

7 34 84

neuralink

https://phucnguyen.dev

AI & ML interests

distributed training @nous research. ex-nanotron @ hf

Recent Activity

liked a model 27 days ago

Qwen/Qwen3-30B-A3B

upvoted a paper about 2 months ago

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

reacted to nouamanetazi's post with 🤗 about 2 months ago

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥 Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️ Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput? That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong. We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes. If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging. 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS Shared with ❤️ by the HuggingFace team

View all activity

Organizations

upvoted a paper about 2 months ago

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Paper • 2510.08189 • Published Oct 9, 2025 • 26

upvoted an article 5 months ago

Article

Arc Virtual Cell Challenge: A Primer

Jul 18, 2025

•

upvoted an article 8 months ago

Article

The Transformers Library: standardizing model definitions

May 15, 2025

•

121

upvoted 2 articles 9 months ago

Article

You could have designed state of the art positional encoding

Nov 25, 2024

•

429

Article

Welcome Llama 4 Maverick & Scout on Hugging Face

Apr 5, 2025

•

146

upvoted a paper 9 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 202

upvoted 2 articles 10 months ago

Article

Open R1: Update #3

Mar 11, 2025

•

296

Article

LLM Inference on Edge: A Fun and Easy Guide to run LLMs via React Native on your Phone!

Mar 7, 2025

•

upvoted 3 articles 11 months ago

Article

Open-source DeepResearch – Freeing our search agents

Feb 4, 2025

•

1.31k

Article

Open-R1: a fully open reproduction of DeepSeek-R1

Jan 28, 2025

•

887

Article

Open-R1: Update #1

Feb 2, 2025

•

305

upvoted 2 papers 12 months ago

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

Paper • 2409.15241 • Published Sep 23, 2024 • 1

Scaling Laws for Floating Point Quantization Training

Paper • 2501.02423 • Published Jan 5, 2025 • 26

upvoted a paper about 1 year ago

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22, 2024 • 259

upvoted a paper over 1 year ago

Small-scale proxies for large-scale Transformer training instabilities

Paper • 2309.14322 • Published Sep 25, 2023 • 21

upvoted an article over 1 year ago

Article

How NuminaMath Won the 1st AIMO Progress Prize

Jul 11, 2024

•

124

upvoted a paper over 1 year ago

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Paper • 2201.02177 • Published Jan 6, 2022 • 2

upvoted an article over 1 year ago

Article

A failed experiment: Infini-Attention, and why we should keep trying?

Aug 14, 2024

•

upvoted 2 papers over 1 year ago

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

Paper • 2405.20233 • Published May 30, 2024 • 7

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8, 2024 • 174

neuralink

AI & ML interests

Recent Activity

Organizations

neuralink's activity

Arc Virtual Cell Challenge: A Primer

The Transformers Library: standardizing model definitions

You could have designed state of the art positional encoding

Welcome Llama 4 Maverick & Scout on Hugging Face

Open R1: Update #3

LLM Inference on Edge: A Fun and Easy Guide to run LLMs via React Native on your Phone!

Open-source DeepResearch – Freeing our search agents

Open-R1: a fully open reproduction of DeepSeek-R1

Open-R1: Update #1

How NuminaMath Won the 1st AIMO Progress Prize

A failed experiment: Infini-Attention, and why we should keep trying?