Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. / Bassani, Riccardo ; Søgaard, Anders; Deoskar, Tejaswini .
Proceedings of the 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics, 2021. s. 32–40.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization
AU - Bassani, Riccardo
AU - Søgaard, Anders
AU - Deoskar, Tejaswini
PY - 2021
Y1 - 2021
N2 - Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
AB - Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
U2 - 10.18653/v1/2021.mrl-1.3
DO - 10.18653/v1/2021.mrl-1.3
M3 - Article in proceedings
SP - 32
EP - 40
BT - Proceedings of the 1st Workshop on Multilingual Representation Learning
PB - Association for Computational Linguistics
T2 - 1st Workshop on Multilingual Representation Learning
Y2 - 11 November 2021 through 11 November 2021
ER -
ID: 300080332