VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Abstract
Vision Language Models struggle with fine-grained visual perception tasks due to their language-centric training approach, performing poorly on unnamed visual entities despite having relevant information in their representations.
Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.
Community
🤔 The VLM community has long had the intuition that vision-focused tasks resisting transcription into text are harder for VLMs. What makes this especially puzzling: prior work has shown the visual information IS there inside the LM's representations. VLMs just don't use it.
🔬Our new work explains why. Across correspondence tasks on natural images, abstract shapes, and faces, VLMs perform dramatically better when the target entity has a name. Inspecting the internal representations, we find that VLMs explicitly convert visual entities into their text labels and solve the task linguistically. Teaching arbitrary names for novel shapes confirms this: once a shape has a label, performance jumps.
🚨 But this is not an architectural limitation. Task-specific finetuning teaches a genuinely different mechanism that bypasses language entirely, and generalizes broadly. The problem is the training pipeline, not the transformer.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts (2026)
- Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness (2026)
- The Dual Mechanisms of Spatial Reasoning in Vision-Language Models (2026)
- LanteRn: Latent Visual Structured Reasoning (2026)
- What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models (2026)
- CrystaL: Spontaneous Emergence of Visual Latents in MLLMs (2026)
- Can Vision-Language Models Solve the Shell Game? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.02486 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper