Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Grammatical Error Correction in Low Error Density Domains : A New Benchmark and Analyses. / Flachs, Simon Hellemann; Lacroix, Ophélie; Yannakoudakis, Helen; Rei, Marek; Søgaard, Anders.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020. s. 8467–8478.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Grammatical Error Correction in Low Error Density Domains
T2 - The 2020 Conference on Empirical Methods in Natural Language Processing
AU - Flachs, Simon Hellemann
AU - Lacroix, Ophélie
AU - Yannakoudakis, Helen
AU - Rei, Marek
AU - Søgaard, Anders
PY - 2020
Y1 - 2020
N2 - Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.
AB - Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.
U2 - 10.18653/v1/2020.emnlp-main.680
DO - 10.18653/v1/2020.emnlp-main.680
M3 - Article in proceedings
SP - 8467
EP - 8478
BT - Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
PB - Association for Computational Linguistics
Y2 - 16 November 2020 through 20 November 2020
ER -
ID: 258376622