This repository collects open source parallel aligned corpuses Catalan to several languages.
We use these corpuses to train the Softcatalà neural translation system:
- English - Catalan
- German - Catalan
- French - Catalan
- Italian - Catalan
- Japanese - Catalan
- Dutch - Catalan
- Portuguese - Catalan
- Spanish - Catalan
- Occitan - Catalan
- Galician - Catalan
- Basque - Catalan
The corpus with extension xz need to be descompressed with xz.
You can do this easily by typing:
make extract-corpus
For backtranslation you may be interested in a monolingual Catalan corpus. You can create a monolingual corpus by typing:
make build-monolingual
This creates a single Catalan file with all unique strings across all language pairs.
We strongly recommend the following sources of aligned Catalan parallel corpuses:
- https://opus.nlpl.eu/
- https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
- On top of these previously available corpus, we have created the following corpus:
- Europarl-catalan
- Tilde-MODEL-catalan
- Open source corpus in serval directions using Softcatalà translation tools
See here (In Catalan)
Contact Jordi Mas [email protected]
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property value name Open source aligned text corpus English, German, French, Italian, Japanese, Portuguese, Spanish, Occitan, Galician, Basque, etc to/from Catalan.
description Open source aligned text corpus for building NLP applications (e.g. machine translation). Already existing corpus have been clean up and new corpus have been introduced: Europarl Catalan, Tilde Catalan and open source translation memories.
sameAs https://github.com/Softcatala/parallel-catalan-corpus/
url https://github.com/Softcatala/parallel-catalan-corpus/
creator Softcatalà