Description

This repository collects open source parallel aligned corpuses Catalan to several languages.

Parallel corpus

We use these corpuses to train the Softcatalà neural translation system:

The corpus with extension xz need to be descompressed with xz.

You can do this easily by typing:

make extract-corpus

Catalan monolingual corpus

For backtranslation you may be interested in a monolingual Catalan corpus. You can create a monolingual corpus by typing:

make build-monolingual

This creates a single Catalan file with all unique strings across all language pairs.

Sources of the corpus used

We strongly recommend the following sources of aligned Catalan parallel corpuses:

https://opus.nlpl.eu/
https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

On top of these previously available corpus, we have created the following corpus:

Europarl-catalan
Tilde-MODEL-catalan
Open source corpus in serval directions using Softcatalà translation tools

Do you want to help?

See here (In Catalan)

Contact

Contact Jordi Mas [email protected]

Metadescription

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property	value
name	`Open source aligned text corpus English, German, French, Italian, Japanese, Portuguese, Spanish, Occitan, Galician, Basque, etc to/from Catalan.`
description	`Open source aligned text corpus for building NLP applications (e.g. machine translation). Already existing corpus have been clean up and new corpus have been introduced: Europarl Catalan, Tilde Catalan and open source translation memories.`
sameAs	`https://github.com/Softcatala/parallel-catalan-corpus/`
url	`https://github.com/Softcatala/parallel-catalan-corpus/`
creator	`Softcatalà`

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data-processing-tools		data-processing-tools
deu-cat		deu-cat
eng-cat		eng-cat
eus-cat		eus-cat
fra-cat		fra-cat
glg-cat		glg-cat
ita-cat		ita-cat
jpn-cat		jpn-cat
nld-cat		nld-cat
oci-cat		oci-cat
por-cat		por-cat
spa-cat		spa-cat
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Parallel corpus

Catalan monolingual corpus

Sources of the corpus used

Do you want to help?

Contact

Metadescription

About

Releases

Packages

Contributors 4

Languages

Softcatala/parallel-catalan-corpus

Folders and files

Latest commit

History

Repository files navigation

Description

Parallel corpus

Catalan monolingual corpus

Sources of the corpus used

Do you want to help?

Contact

Metadescription

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages