ADVANCING NATURAL LANGUAGE UNDERSTANDING FOR LOW-RESOURCE LANGUAGES: CURRENT PROGRESS, APPLICATIONS, AND CHALLENGES

Abhi Ram Reddy Salammagari; Gaurava Srivastava

Authors

Abhi Ram Reddy Salammagari 247 AI Inc., USA Author
Gaurava Srivastava Oracle America Inc, USA. Author

Keywords:

Low-resource Languages, Natural Language Understanding (NLU), Transfer Learning, Unsupervised Learning, Cross-lingual Embeddings

Abstract

Natural Language Understanding (NLU) technologies have made significant strides in recent years, but their benefits have not been equally distributed across all languages. Low-resource languages, characterized by limited digital resources and annotated datasets, face unique challenges that hinder the development of effective NLU systems. This article explores the importance of advancing NLU technologies for low-resource languages, highlighting their potential to promote linguistic diversity, enable global communication, and ensure equal access to information and technology. The article discusses the applications of NLU in areas such as automated translation, voice-activated assistants, and educational tools, emphasizing their role in fostering inclusivity and preserving cultural heritage. It also delves into the advancements made in transfer learning, unsupervised learning, and cross-lingual embeddings, which have shown promise in addressing the scarcity of resources and linguistic complexity of low-resource languages. The challenges posed by data scarcity, language diversity, and the need for language-agnostic algorithms are examined, along with innovative approaches to data collection, model training, and evaluation. The article concludes by highlighting the importance of collaboration between researchers, linguists, and community stakeholders in driving progress and innovation in this field, ultimately paving the way for more inclusive and equitable NLU technologies that serve the needs of all language communities.

References

T. Young, D. Hazarika, S. Poria, and E. Cambria, "Recent trends in deep learning based natural language processing," IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55-75, 2018.

J. Hirschberg and C. D. Manning, "Advances in natural language processing," Science, vol. 349, no. 6245, pp. 261-266, 2015.

E. Strubell, A. Ganesh, and A. McCallum, "Energy and policy considerations for deep learning in NLP," arXiv preprint arXiv:1906.02243, 2019.

A. Joshi, S. Bhattacharyya, and R. J. Mooney, "Towards more inclusive machine learning: The case of underrepresented languages," arXiv preprint arXiv:2110.04383, 2021.

T. Gebru et al., "Datasheets for datasets," arXiv preprint arXiv:1803.09010, 2018.

M. Mitchell et al., "Model cards for model reporting," in Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 220-229.

D. Crystal, Language Death. Cambridge University Press, 2000.

S. Bird, "Decolonising speech and language technology," in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 3504-3519.

A. Anastasopoulos et al., "TICO-19: The translation initiative for COVID-19," arXiv preprint arXiv:2007.01788, 2020.

J. Gu, H. Hassan, J. Devlin, and V. O. Li, "Universal neural machine translation for extremely low resource languages," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational

J. Gu, H. Hassan, J. Devlin, and V. O. Li, "Universal neural machine translation for extremely low resource languages," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 344-354.

M. Artetxe, G. Labaka, and E. Agirre, "Unsupervised neural machine translation," arXiv preprint arXiv:1710.11041, 2017.

G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, "Unsupervised machine translation using monolingual corpora only," arXiv preprint arXiv:1711.00043, 2017.

S. Gopalakrishnan et al., "Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations," in Proc. Interspeech 2019, 2019, pp. 1891-1895.

J. Schalkwyk et al., "Scaling Up Neural Transducer Models for Streaming Speech Recognition," in Proc. Interspeech 2020, 2020, pp. 3911-3915.

T. Tan et al., "Multi-task Learning of Multilingual BERT for Predicting the Language Families," in Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 2847-2853.

X. Wan, "A novel document similarity measure based on earth mover's distance," Information Sciences, vol. 177, no. 18, pp. 3718-3730, 2007.

A. M. Ciobanu and L. P. Dinu, "Automatic detection of cognates using orthographic alignment," in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 99-105.

D. M. Eberhard, G. F. Simons, and C. D. Fennig, Ethnologue: Languages of the World, Twenty-fourth edition. SIL International, 2021.

S. Ruder, "Why You Should Do NLP Beyond English," Gradientχ, 2020.

J. Kaplan et al., "Scaling Laws for Neural Language Models," arXiv preprint arXiv:2001.08361, 2020.

E. Grave et al., "Unsupervised Hyper-alignment for Multilingual Word Embeddings," in International Conference on Learning Representations, 2019.

S. Ruder, I. Vulić, and A. Søgaard, "A survey of cross-lingual word embedding models," Journal of Artificial Intelligence Research, vol. 65, pp. 569-631, 2019.

M. Artetxe, G. Labaka, and E. Agirre, "A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 789-798.

J.-P. Chiu and E. Nichols, "Named entity recognition with bidirectional LSTM-CNNs," Transactions of the Association for Computational Linguistics, vol. 4, pp. 357-370, 2016.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171-4186.

A. Conneau et al., "Unsupervised cross-lingual representation learning at scale," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440-8451.

S. Ruder, "Neural transfer learning for natural language processing," Ph.D. dissertation, National University of Ireland, Galway, 2019.

M. Lewis et al., "BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871-7880.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in Neural Information Processing Systems, 2013, pp. 3111-3119.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017.

P. Jansen, "Unsupervised natural language processing for knowledge discovery in unstructured biomedical text," Ph.D. dissertation, The University of Arizona, 2017.

G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, "Phrase-based & neural unsupervised machine translation," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 5039-5049.

K. Chaudhary, S. Kumar, S. Joshi, and R. Mundotiya, "A survey of unsupervised techniques for learning word representations," in Proceedings of the 2020 International Conference on Computational Linguistics and Intelligent Text Processing, 2020.

M. Faruqui and C. Dyer, "Improving vector space word representations using multilingual correlation," in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 462-471.

T. Mikolov, Q. V. Le, and I. Sutskever, "Exploiting similarities among languages for machine translation," arXiv preprint arXiv:1309.4168, 2013.

A. Joulin, P. Bojanowski, T. Mikolov, H. Jégou, and E. Grave, "Loss in translation: Learning bilingual word mapping with a retrieval criterion," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2979-2984.

Y. Hoshen and L. Wolf, "Non-adversarial unsupervised word translation," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 469-478.

S. L. Smith, D. H. Turban, S. Hamblin, and N. Y. Hammerla, "Offline bilingual word vectors, orthogonal transformations and the inverted softmax," arXiv preprint arXiv:1702.03859, 2017.

M. Artetxe and H. Schwenk, "Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond," Transactions of the Association for Computational Linguistics, vol. 7, pp. 597-610, 2019.

Y. Wu et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.

P. Koehn and R. Knowles, "Six challenges for neural machine translation," in Proceedings of the First Workshop on Neural Machine Translation, 2017, pp. 28-39.

M. Johnson et al., "Google's multilingual neural machine translation system: Enabling zero-shot translation," Transactions of the Association for Computational Linguistics, vol. 5, pp. 339-351, 2017.

T. Q. Nguyen and D. Chiang, "Transfer learning across low-resource, related languages for neural machine translation," in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017, pp. 296-301.

M. Artetxe, G. Labaka, E. Agirre, and K. Cho, "Unsupervised neural machine translation," in Proceedings of the Sixth International Conference on Learning Representations, 2018.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.

R. Sennrich and B. Zhang, "Revisiting low-resource neural machine translation: A case study," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 211-221.

T. Gao et al., "Toward evaluation of NLP systems for low-resource languages," arXiv preprint arXiv:2105.13756, 2021.

S. Ruder, "A survey of cross-lingual techniques for low-resource natural language processing," arXiv preprint arXiv:1806.04620, 2018.

R. Sennrich, B. Haddow, and A. Birch, "Improving neural machine translation models with monolingual data," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 86-96.

B. Settles, "From theories to queries: Active learning in practice," in Active Learning and Experimental Design workshop in conjunction with AISTATS 2010, 2011, pp. 1-18.

J. Mielke et al., "Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP," arXiv preprint arXiv:2112.10508, 2021.

E. L. Ablin, "Polysynthetic languages pose a challenge for natural language processing," Physics Today, 2021.

Z. Lin, X. Pan, M. Wang, X. Qiu, J. Feng, H. Zhou, and L. Li, "Pre-training multilingual neural machine translation by leveraging alignment information," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 2649-2663.

J. R. Finkel, T. Grenager, and C. Manning, "Incorporating non-local information into information extraction systems by gibbs sampling," in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), 2005, pp. 363-370.

R. Chowdhury et al., "Towards language agnostic universal representations," arXiv preprint arXiv:2107.04028, 2021.

D. M. Eberhard, G. F. Simons, and C. D. Fennig, Eds., Ethnologue: Languages of the World, 24th ed. Dallas, Texas: SIL International, 2021.

C. Xing, D. Wang, C. Liu, and Y. Lin, "Normalized word embedding and orthogonal transform for bilingual word translation," in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 1006-1011.

H. Suresh and J. V. Guttag, "A framework for understanding unintended consequences of machine learning," arXiv preprint arXiv:1901.10002, 2019.

X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, and H. Ji, "Cross-lingual name tagging and linking for 282 languages," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1946-1958.

G. Neto et al., "Assessing the Impact of Contextual Information in Guiding the Annotation Procedure in Crowdsourcing Activities," in Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 2997-3006.

A. v. d. Oord, Y. Li, and O. Vinyals, "Representation learning with contrastive predictive coding," arXiv preprint arXiv:1807.03748, 2018.

J. Pfeiffer, I. Vulić, I. Gurevych, and S. Ruder, "MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7654-7673.

Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, "ERNIE: Enhanced Language Representation with Informative

A. Conneau et al., "Unsupervised Cross-lingual Representation Learning at Scale," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440-8451. [65] Y. Liu et al., "Multilingual Denoising Pre-training for Neural Machine Translation," Transactions of the Association for Computational Linguistics, vol. 8, pp. 726-742, 2020.

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, "Unsupervised cross-lingual representation learning at scale," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440-8451.

M. Artetxe and H. Schwenk, "Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond," Transactions of the Association for Computational Linguistics, vol. 7, pp. 597-610, 2019.

ADVANCING NATURAL LANGUAGE UNDERSTANDING FOR LOW-RESOURCE LANGUAGES: CURRENT PROGRESS, APPLICATIONS, AND CHALLENGES

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

cover