Guideline Bias in Wizard-of-Oz Dialogues
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Guideline Bias in Wizard-of-Oz Dialogues. / Bach Hansen, Victor Petrén; Søgaard, Anders.
BPPF 2021 - 1st Workshop on Benchmarking: Past, Present and Future, Proceedings. red. / Kenneth Church; Mark Liberman; Valia Kordoni. Association for Computational Linguistics, 2021. s. 8-14.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Guideline Bias in Wizard-of-Oz Dialogues
AU - Bach Hansen, Victor Petrén
AU - Søgaard, Anders
N1 - Publisher Copyright: ©2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - NLP models struggle with generalization due to sampling and annotator bias. This paper focuses on a different kind of bias that has received very little attention: guideline bias, i.e., the bias introduced by how our annotator guidelines are formulated. We examine two recently introduced dialogue datasets, CCPE-M and Taskmaster-1, both collected by trained assistants in a Wizard-of-Oz set-up. For CCPE-M, we show how a simple lexical bias for the word like in the guidelines biases the data collection. This bias, in effect, leads to poor performance on data without this bias: a preference elicitation architecture based on BERT suffers a 5.3% absolute drop in performance, when like is replaced with a synonymous phrase, and a 13.2% drop in performance when evaluated on out-of-sample data. For Taskmaster-1, we show how the order in which instructions are presented, biases the data collection.
AB - NLP models struggle with generalization due to sampling and annotator bias. This paper focuses on a different kind of bias that has received very little attention: guideline bias, i.e., the bias introduced by how our annotator guidelines are formulated. We examine two recently introduced dialogue datasets, CCPE-M and Taskmaster-1, both collected by trained assistants in a Wizard-of-Oz set-up. For CCPE-M, we show how a simple lexical bias for the word like in the guidelines biases the data collection. This bias, in effect, leads to poor performance on data without this bias: a preference elicitation architecture based on BERT suffers a 5.3% absolute drop in performance, when like is replaced with a synonymous phrase, and a 13.2% drop in performance when evaluated on out-of-sample data. For Taskmaster-1, we show how the order in which instructions are presented, biases the data collection.
UR - http://www.scopus.com/inward/record.url?scp=85123954959&partnerID=8YFLogxK
U2 - 10.18653/v1/2021.bppf-1.2
DO - 10.18653/v1/2021.bppf-1.2
M3 - Article in proceedings
AN - SCOPUS:85123954959
SP - 8
EP - 14
BT - BPPF 2021 - 1st Workshop on Benchmarking
A2 - Church, Kenneth
A2 - Liberman, Mark
A2 - Kordoni, Valia
PB - Association for Computational Linguistics
T2 - 1st Workshop on Benchmarking: Past, Present and Future, BPPF 2021
Y2 - 5 August 2021 through 6 August 2021
ER -
ID: 291812390