This package includes scripts for training NMT models using MarianNMT and OPUS data for Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
The subdirectory Tatoeba translation challenge also under a CC-BY 4.0 license license.
Setting up:
git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install
Look into lib/env.mk
and adust any settings that you need in your environment.
For CSC-users: adjust lib/env/puhti.mk
and lib/env/mahti.mk
to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).
Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):
make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release
More information is available in the documentation linked below.
- Installation and setup
- Details about tasks and recipes
- Information about back-translation
- Information about Fine-tuning models
- How to generate pivot-language-based translations
Please, cite the following papers if you use OPUS-MT software and models:
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato\
, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
journal={Language Resources and Evaluation},
number={58},
pages={713--755},
year={2023},
publisher={Springer Nature},
issn={1574-0218},
doi={10.1007/s10579-023-09704-w}
}
@InProceedings{TiedemannThottingal:EAMT2020,
author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
year = {2020},
address = {Lisbon, Portugal}
}
None of this would be possible without all the great open source software including
- GNU/Linux tools
- Marian-NMT
- eflomal
... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...
We would also like to acknowledge the support by the FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.