Efficient Subword Models for Low-Resource NLP

Suraj Prakash Kumar

doi:10.15662/IJRAI.2019.0205002

Authors

Suraj Prakash Kumar M.S.W., Hemwati Nandan Bahuguna Garhwal University, Srinagar, Garhwal, Uttarakhand, India Author

DOI:

https://doi.org/10.15662/IJRAI.2019.0205002

Keywords:

Subword Models, Low-Resource Languages, Byte Pair Encoding (BPE), WordPiece, Named Entity Recognition (NER), Machine Translation (MT), Morphological Complexity, Data SparsityGeeksforGeeks

Abstract

Efficient subword models have become pivotal in enhancing the performance of Natural Language Processing (NLP) tasks, especially for low-resource languages. These models address challenges such as data sparsity and morphological complexity by segmenting words into smaller, meaningful units. Techniques like Byte Pair Encoding (BPE) and WordPiece have demonstrated significant improvements in tasks like Named Entity Recognition (NER) and Machine Translation (MT) for languages with limited annotated data. This paper explores various subword modeling approaches, evaluates their effectiveness in low-resource settings, and discusses their implications for future NLP research and applications.