A Transformer-based Parser for Syriac Morphology
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
A Transformer-based Parser for Syriac Morphology. / Naaijer, Martijn; Sikkel, Constantijn; Coeckelbergs, Mathias; Attema, Jisk; Van Peursen, Willem.
Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing RANLP 2023. Varna, Bulgaria, 2023. p. 23-29.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - A Transformer-based Parser for Syriac Morphology
AU - Naaijer, Martijn
AU - Sikkel, Constantijn
AU - Coeckelbergs, Mathias
AU - Attema, Jisk
AU - Van Peursen, Willem
PY - 2023
Y1 - 2023
N2 - In this project we train a Transformer-basedmodel from scratch, with the goal of parsing themorphology of Ancient Syriac texts asaccurately as possible. Syriac is a low-resourcelanguage, only a relatively small training setwas available. Therefore, the training set wasexpanded by adding Biblical Hebrew data to it.Five different experiments were done: themodel was trained on Syriac data only, it wastrained with mixed Syriac and (un)vocalizedHebrew data, and it was trained first on(un)vocalized Hebrew data and then trainedfurther on Syriac data. The models trained onHebrew and Syriac data consistentlyoutperform the models trained on Syriac dataonly. This shows that the differences betweenSyriac and Hebrew are small enough that it isworth adding Hebrew data to train the modelfor parsing Syriac morphology. Trainingmodels with data from multiple languages is animportant trend in NLP, we show that thisworks well for relatively small datasets ofSyriac and Hebrew.
AB - In this project we train a Transformer-basedmodel from scratch, with the goal of parsing themorphology of Ancient Syriac texts asaccurately as possible. Syriac is a low-resourcelanguage, only a relatively small training setwas available. Therefore, the training set wasexpanded by adding Biblical Hebrew data to it.Five different experiments were done: themodel was trained on Syriac data only, it wastrained with mixed Syriac and (un)vocalizedHebrew data, and it was trained first on(un)vocalized Hebrew data and then trainedfurther on Syriac data. The models trained onHebrew and Syriac data consistentlyoutperform the models trained on Syriac dataonly. This shows that the differences betweenSyriac and Hebrew are small enough that it isworth adding Hebrew data to train the modelfor parsing Syriac morphology. Trainingmodels with data from multiple languages is animportant trend in NLP, we show that thisworks well for relatively small datasets ofSyriac and Hebrew.
M3 - Article in proceedings
SN - 978-954-452-087-8
SP - 23
EP - 29
BT - Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing RANLP 2023
CY - Varna, Bulgaria
ER -
ID: 366755414