Papers - Image - Encoders
updated
CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
Paper
• 2107.00652
• Published • 2
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Paper
• 2403.09622
• Published • 17
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published • 10
mPLUG-Owl: Modularization Empowers Large Language Models with
Multimodality
Paper
• 2304.14178
• Published • 3
ViTAR: Vision Transformer with Any Resolution
Paper
• 2403.18361
• Published • 55
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
• 2403.18978
• Published • 15
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact
Language Model
Paper
• 2404.01331
• Published • 27
PointInfinity: Resolution-Invariant Point Diffusion Models
Paper
• 2404.03566
• Published • 16
TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models
Paper
• 2109.10282
• Published • 13
Text Role Classification in Scientific Charts Using Multimodal
Transformers
Paper
• 2402.14579
• Published • 1