Unsupervised Evaluation for Question Answering with Transformers

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Unsupervised Evaluation for Question Answering with Transformers. / Muttenthaler, Lukas; Augenstein, Isabelle; Bjerva, Johannes.

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2020. p. 83-90.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Muttenthaler, L, Augenstein, I & Bjerva, J 2020, Unsupervised Evaluation for Question Answering with Transformers. in Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, pp. 83-90, The 2020 Conference on Empirical Methods in Natural Language Processing, 16/11/2020. https://doi.org/10.18653/v1/2020.blackboxnlp-1.8

APA

Muttenthaler, L., Augenstein, I., & Bjerva, J. (2020). Unsupervised Evaluation for Question Answering with Transformers. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (pp. 83-90). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.blackboxnlp-1.8

Vancouver

Muttenthaler L, Augenstein I, Bjerva J. Unsupervised Evaluation for Question Answering with Transformers. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics. 2020. p. 83-90 https://doi.org/10.18653/v1/2020.blackboxnlp-1.8

Author

Muttenthaler, Lukas ; Augenstein, Isabelle ; Bjerva, Johannes. / Unsupervised Evaluation for Question Answering with Transformers. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2020. pp. 83-90

Bibtex

@inproceedings{0814b8ae81b94c41a9ccb411626a6cfc,

title = "Unsupervised Evaluation for Question Answering with Transformers",

abstract = "It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model{\textquoteright}s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.",

author = "Lukas Muttenthaler and Isabelle Augenstein and Johannes Bjerva",

year = "2020",

doi = "10.18653/v1/2020.blackboxnlp-1.8",

language = "English",

pages = "83--90",

booktitle = "Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP",

publisher = "Association for Computational Linguistics",

note = "The 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 ; Conference date: 16-11-2020 Through 20-11-2020",

url = "http://2020.emnlp.org",

}

RIS

TY - GEN

T1 - Unsupervised Evaluation for Question Answering with Transformers

AU - Muttenthaler, Lukas

AU - Augenstein, Isabelle

AU - Bjerva, Johannes

PY - 2020

Y1 - 2020

N2 - It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.

AB - It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.

U2 - 10.18653/v1/2020.blackboxnlp-1.8

DO - 10.18653/v1/2020.blackboxnlp-1.8

M3 - Article in proceedings

SP - 83

EP - 90

BT - Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

PB - Association for Computational Linguistics

T2 - The 2020 Conference on Empirical Methods in Natural Language Processing

Y2 - 16 November 2020 through 20 November 2020

ER -

ID: 254996871