SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper
•
2506.01844
•
Published
•
147
Note A reinforcemenet fine-tuning framework that uses a simulator. Looks very promising! It looks like they predict future actions/frames to generate data for reinforcment learning? They train a model on a dataset that predicts images and its rewards They work in two stages: 1. WM and policy pretraining. -> Train a world model on existing dataset 2. VLA Optimization through WM interaction -> VLA fine tuning using the world model in chuncks.