In the context of Machine Translation (MT) from-and-to English, Bahasa Indonesia has been considered a low-resource language, and
therefore applying Neural Machine Translation (NMT) which typically requires large training dataset proves to be problematic. In this
paper, we show otherwise by collecting large, publicly-available datasets from the Web, which we split into several domains: news,
religion, general, and conversation, to train and benchmark some variants of transformer-based NMT models across the domains. We
show using BLEU that our models perform well across them and perform comparably with Google Translate. Our datasets (with the
standard split for training, validation, and testing), code, and models are available on https://github.com/gunnxx/indonesian-mt-data.
Keywords: Neural machine translation, parallel corpus, English-Indonesian, Indonesian
More on:
- https://aclanthology.org/2020.bucc-1.6.pdf
- https://github.com/gunnxx/indonesian-mt-data