TY - GEN
T1 - Leverage Points in Modality Shifts
T2 - 12th Joint Conference on Lexical and Computational Semantics, StarSEM 2023, co-located with ACL 2023
AU - Tikhonov, Aleksey
AU - Bylinina, Lisa
AU - Paperno, Denis
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known about the systematic representation differences attributed to the visual modality. Our paper compares word embeddings from three vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP, Radford et al. 2021; Ilharco et al. 2021; Carlsson et al. 2022) and three text-only models, with static (FastText, Bojanowski et al., 2017) as well as contextual representations (multilingual BERT Devlin et al. 2018; XLM-RoBERTa, Conneau et al. 2019). This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters. We identify meaning properties and relations that characterize words whose embeddings are most affected by the inclusion of visual modality in the training data; that is, points where visual grounding turns out most important. We find that the effect of visual modality correlates most with denotational semantic properties related to concreteness, but is also detected for several specific semantic classes, as well as for valence, a sentiment-related connotational property of linguistic expressions.
AB - Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known about the systematic representation differences attributed to the visual modality. Our paper compares word embeddings from three vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP, Radford et al. 2021; Ilharco et al. 2021; Carlsson et al. 2022) and three text-only models, with static (FastText, Bojanowski et al., 2017) as well as contextual representations (multilingual BERT Devlin et al. 2018; XLM-RoBERTa, Conneau et al. 2019). This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters. We identify meaning properties and relations that characterize words whose embeddings are most affected by the inclusion of visual modality in the training data; that is, points where visual grounding turns out most important. We find that the effect of visual modality correlates most with denotational semantic properties related to concreteness, but is also detected for several specific semantic classes, as well as for valence, a sentiment-related connotational property of linguistic expressions.
UR - http://www.scopus.com/inward/record.url?scp=85175398572&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.starsem-1.2
DO - 10.18653/v1/2023.starsem-1.2
M3 - Conference contribution
AN - SCOPUS:85175398572
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 11
EP - 17
BT - Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
A2 - Palmer, Alexis
A2 - Camacho-Collados, Jose
PB - Association for Computational Linguistics, ACL Anthology
Y2 - 13 July 2023 through 14 July 2023
ER -