Unsupervised Evaluation for Question Answering with Transformers
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Unsupervised Evaluation for Question Answering with Transformers. / Muttenthaler, Lukas; Augenstein, Isabelle; Bjerva, Johannes.
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2020. p. 83-90.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Unsupervised Evaluation for Question Answering with Transformers
AU - Muttenthaler, Lukas
AU - Augenstein, Isabelle
AU - Bjerva, Johannes
PY - 2020
Y1 - 2020
N2 - It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.
AB - It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.
U2 - 10.18653/v1/2020.blackboxnlp-1.8
DO - 10.18653/v1/2020.blackboxnlp-1.8
M3 - Article in proceedings
SP - 83
EP - 90
BT - Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
PB - Association for Computational Linguistics
T2 - The 2020 Conference on Empirical Methods in Natural Language Processing
Y2 - 16 November 2020 through 20 November 2020
ER -
ID: 254996871