view article Article The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix Nov 3, 2025 • 53
view article Article System Prompt Learning: Teaching LLMs to Learn Problem-Solving Strategies from Experience Jun 2, 2025 • 23
Prompt-MII: Meta-Learning Instruction Induction for LLMs Paper • 2510.16932 • Published Oct 19, 2025 • 7
view article Article LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR Oct 23, 2025 • 62
view article Article ViDoRe V3: a comprehensive evaluation of retrieval for enterprise use-cases Nov 5, 2025 • 57
Tarka Embed V1 Collection Efficient DFKD embeddings for language understanding • 5 items • Updated 19 days ago • 6
view article Article Train 400x faster Static Embedding Models with Sentence Transformers Jan 15, 2025 • 222
POTION Collection These are the flagship POTION models. Load them and use them with model2vec (https://github.com/MinishLab/model2vec) or sentence-transformers • 6 items • Updated Nov 13, 2025 • 14
V-JEPA 2 Collection A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 181
Meta CLIP 1 Collection Scaling CLIP data with transparent training distribution from an end-to-end pipeline. • 7 items • Updated Nov 24, 2025 • 21
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models Paper • 2502.09604 • Published Feb 13, 2025 • 37
Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine Paper • 2510.21614 • Published Oct 24, 2025 • 22
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Paper • 2507.04009 • Published Jul 5, 2025 • 51