Fine-Grained Grounding for Multimodal Speech Recognition

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Fine-Grained Grounding for Multimodal Speech Recognition. / Srinivasan, Tejas; Sanabria, Ramon; Metze, Florian; Elliott, Desmond.

Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. p. 2667-2677.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Srinivasan, T, Sanabria, R, Metze, F & Elliott, D 2020, Fine-Grained Grounding for Multimodal Speech Recognition. in Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, pp. 2667-2677, Findings of the Association of Computational Linguistics, 16/11/2020. https://doi.org/10.18653/v1/2020.findings-emnlp.242

APA

Srinivasan, T., Sanabria, R., Metze, F., & Elliott, D. (2020). Fine-Grained Grounding for Multimodal Speech Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 2667-2677). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.242

Vancouver

Srinivasan T, Sanabria R, Metze F, Elliott D. Fine-Grained Grounding for Multimodal Speech Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics. 2020. p. 2667-2677 https://doi.org/10.18653/v1/2020.findings-emnlp.242

Author

Srinivasan, Tejas ; Sanabria, Ramon ; Metze, Florian ; Elliott, Desmond. / Fine-Grained Grounding for Multimodal Speech Recognition. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. pp. 2667-2677

Bibtex

@inproceedings{8218c462aa9e4cd4bce5fa6a8996044a,

title = "Fine-Grained Grounding for Multimodal Speech Recognition",

abstract = " Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals. ",

keywords = "cs.CL",

author = "Tejas Srinivasan and Ramon Sanabria and Florian Metze and Desmond Elliott",

year = "2020",

doi = "10.18653/v1/2020.findings-emnlp.242",

language = "English",

pages = "2667--2677",

booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",

publisher = "Association for Computational Linguistics",

note = "Findings of the Association of Computational Linguistics : EMNLP 2020 ; Conference date: 16-11-2020 Through 20-11-2020",

}

RIS

TY - GEN

T1 - Fine-Grained Grounding for Multimodal Speech Recognition

AU - Srinivasan, Tejas

AU - Sanabria, Ramon

AU - Metze, Florian

AU - Elliott, Desmond

PY - 2020

Y1 - 2020

N2 - Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

AB - Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

KW - cs.CL

U2 - 10.18653/v1/2020.findings-emnlp.242

DO - 10.18653/v1/2020.findings-emnlp.242

M3 - Article in proceedings

SP - 2667

EP - 2677

BT - Findings of the Association for Computational Linguistics: EMNLP 2020

PB - Association for Computational Linguistics

T2 - Findings of the Association of Computational Linguistics

Y2 - 16 November 2020 through 20 November 2020

ER -

ID: 305182727