Skip to content

Open source bilingual Catalan corpus used to train machine learning systems

Notifications You must be signed in to change notification settings

Softcatala/parallel-catalan-corpus

Repository files navigation

Description

This repository collects open source parallel aligned corpuses Catalan to several languages.

Parallel corpus

We use these corpuses to train the Softcatalà neural translation system:

The corpus with extension xz need to be descompressed with xz.

You can do this easily by typing:

make extract-corpus

Catalan monolingual corpus

For backtranslation you may be interested in a monolingual Catalan corpus. You can create a monolingual corpus by typing:

make build-monolingual

This creates a single Catalan file with all unique strings across all language pairs.

Sources of the corpus used

We strongly recommend the following sources of aligned Catalan parallel corpuses:

About

Open source bilingual Catalan corpus used to train machine learning systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant