Scratch to Scale

classroom

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

woojun-jung authored a paper about 1 month ago

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

SJCaldwell authored a paper 2 months ago

AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

SJCaldwell authored a paper 2 months ago

PentestJudge: Judging Agent Behavior Against Operational Requirements

View all activity

Aurelien-Morgan

posted an update about 1 month ago

Post

1084

@retrain-pipelines v0.2.0 is out !
I'm at Station F at My booth with GOSIM Paris 2026 today & tomorrow.
Come meet me for a live in-person demo and a chat !

1 reply

woojun-jung

authored a paper about 1 month ago

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Paper • 2604.24052 • Published Apr 27

Aurelien-Morgan

posted an update about 2 months ago

Post

226

Launching a workweek of @retrain-pipelines wheels.

Day #1 : Compose

4 replies

chansung

authored a paper 4 months ago

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Paper • 2602.15449 • Published Feb 17 • 7

chansung

submitted a paper to Daily Papers 4 months ago

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Paper • 2602.15449 • Published Feb 17 • 7

eliebak

submitted a paper to Daily Papers 6 months ago

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Paper • 2512.14080 • Published Dec 16, 2025 • 9

woojun-jung

authored a paper 6 months ago

Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Paper • 2512.10362 • Published Dec 11, 2025 • 1

Aurelien-Morgan

posted an update 6 months ago

Post

372

Hey, I went to Hangzhou to talk about retrain-pipelines at the GOSIM Foundation's conference last september.
The recording just got released. Go check it out !
https://www.youtube.com/watch?v=nmrMachM5aM
Slides are there :
https://docs.google.com/presentation/d/1hnAzHJ0SbeAOtGJir-iH84RBtXT1OxVT/

2 replies

eliebak

posted an update 9 months ago

Post

4506

Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST 🤗

science

Saurav2023

published a Space 9 months ago

README

👀

eliebak

posted an update 9 months ago

Post

773

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B

DandinPower

authored 2 papers 10 months ago

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Paper • 2505.23254 • Published May 29, 2025

Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning

Paper • 2507.03305 • Published Jul 4, 2025

eliebak

posted an update 11 months ago

Post

4843

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

dsouzadaniel

authored a paper 11 months ago

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Paper • 2506.20544 • Published Jun 25, 2025 • 11

chansung

posted an update 11 months ago

Post

4838

YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).

Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.

Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.

eliebak

authored a paper 12 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 61

Aurelien-Morgan

posted an update about 1 year ago

Post

468

Hey, I'll be presenting @retrain-pipelines and almighty function-calling at the Hugging Face Paris HQ, you guys.
Monday evening. Lightning-talk style. With AI Tinkerers.

Come hang !

https://paris.aitinkerers.org/p/ai-tinkerers-paris-ai21-labs-takeover-on-may-19th

https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller

dsouzadaniel

authored a paper about 1 year ago

The Leaderboard Illusion

Paper • 2504.20879 • Published Apr 29, 2025 • 72

Aurelien-Morgan

posted an update about 1 year ago

Post

3162

The Almighty function-caller

How would you like to build smart GenAi infrastructure ?
Give extensive tools memory to your edge agentic system,
And optimize the resources it takes to run yet a high-performance set of agents ?

We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.

Read our full-fledged blog article on this here on Hugging Face :
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller

AI & ML interests

Recent Activity

Team members 279

scratchtoscale's activity

README