Interpreting OCR

Investigating textual semantic subspaces within image encoders for rendered text. This work shows strong evidence for a textual semantic subspace inside the image encoder for rendered text.

(Vennam et al., 2024). Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Our work studies the semantics of text rendered in images. We show evidence suggesting that the image representations of CLIP have a subspace for textual semantics that abstracts away fonts. Furthermore, we show that the rendered text representations from the image encoder only slightly lag behind the text representations with respect to preserving semantic relationships.

Related Publications

2024

  1. UniReps @ NeurIPS
    Emergence of Text Semantics in CLIP Image Encoders
    Sreeram Vennam, Shashwat Singh, Anirudh Govil, and Ponnurangam Kumaraguru
    In UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024