arxiv:2604.02486

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Published on Apr 2

· Submitted by

Authors:

Abstract

Vision Language Models struggle with fine-grained visual perception tasks due to their language-centric training approach, performing poorly on unnamed visual entities despite having relevant information in their representations.

AI-generated summary

Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

View arXiv page View PDF Add to collection

Community

Lancelot53

Paper submitter about 15 hours ago

🤔 The VLM community has long had the intuition that vision-focused tasks resisting transcription into text are harder for VLMs. What makes this especially puzzling: prior work has shown the visual information IS there inside the LM's representations. VLMs just don't use it.

🔬Our new work explains why. Across correspondence tasks on natural images, abstract shapes, and faces, VLMs perform dramatically better when the target entity has a name. Inspecting the internal representations, we find that VLMs explicitly convert visual entities into their text labels and solve the task linguistically. Teaching arbitrary names for novel shapes confirms this: once a shape has a label, performance jumps.

🚨 But this is not an architectural limitation. Task-specific finetuning teaches a genuinely different mechanism that bypasses language entirely, and generalizes broadly. The problem is the training pipeline, not the transformer.

librarian-bot

about 7 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.02486

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.02486 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.02486 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.02486 in a Space README.md to link it from this page.