Fast Neural Machine Translation in C++
Version: v1.12.0 65bf82f 2023-02-21 09:56:29 -0800
Marian toolkit provides the following tools:
The amun tool offering CPU and GPU translation with specific
Marian and Nematus models, which used to be a part of Marian, has been moved to
its separate repository and is available from:
Command-line options
Click on the tool name above for a list of command line options. See options
for previous releases. The developer documentation for Marian is
generated using Doxygen and Sphinx. The newest version can be generated locally
from the Clone a fresh copy from github: The project is a standard CMake out-of-source build, which on Linux can be compiled by executing the
following commands: The complete list of compilation options in the form of CMake flags can be
obtained by running For details on installation under Windows see the documentation below. Marian can be built on Windows using CMake or as a Visual Studio project. Both
CPU and GPU builds are supported. Read more about this in
Ubuntu packages
Assuming a fresh Ubuntu LTS installation with CUDA, the following packages need
to be installed to compile with all features, including the web server,
built-in SentencePiece and TCMalloc support. Ubuntu 20.04 + CUDA 10.1 (defaults are gcc 9.3.0, Boost 1.71): Ubuntu 18.04 + CUDA 9.2 (gcc 7.3.0, Boost 1.65): Ubuntu 16.04 + CUDA 9.2 (gcc 5.4.0, Boost 1.58): Refer to the GCC/CUDA compatibility
table if you experience
compilation issues with different versions of GCC and CUDA. Marian will be compiled statically if the flag Download, compile and install Boost: If Boost can not be compiled on your machine because an error like this occurs:
boost error “none” is not a known value of feature <optimization>, you may
try adding To compile Marian training framework with your custom Boost installation: Tested on Ubuntu 16.04.3 LTS. Since 1.9.0, Boost is only required if you compile the web server tool
supplying Specify the path to your CUDA root directory via CMake: Marian CPU version requires Intel MKL or
OpenBLAS. Both are free, but MKL is not
open-sourced. Intel MKL is strongly recommended as it is faster. On Ubuntu
16.04 and newer it can be installed from the APT repositories: For more details see the official instructions. A CPU build needs to be enabled by adding Compilation with SentencePiece that is built-it in Marian v1.6.2+ can be
enabled by adding You may also compile Protobuf from source. For Ubuntu 16.04 LTS, version 2.6.1
(and possibly newer) works: and set the following CMake flags in Marian compilation: For more details see the documentation in the SentencePiece repo:
https://github.com/marian-nmt/sentencepiece#c-from-source For training NMT models, you want to use Command options can be also specified in a configuration file in YAML format: which simplifies the command to: Command-line options overwrite options stored in the configuration file. For multi-GPU training you only need to specify the device ids of the GPUs you
want to use for training (this also works with most other binaries) as
By default, this will use asynchronous SGD (or rather ADAM). For the deeper
models and the transformer model, we found async SGD to be unreliable and you
may want to use a synchronous SGD variant by setting For asynchronous SGD, the mini-batch size is used locally, i.e. For synchronous SGD, the mini-batch size is used globally and will be divided
across the number of workers. This means that for synchronous SGD the effective
mini-batch can be set N times larger for N GPUs. A mini-batch size of
The choice of workspace memory, mini-batch size and max-length is quite involved
and depends on the type of model, the available GPU memory, the number of GPUs,
a number of other parameters like the chosen optimization algorithm, and the
average or maximum sentence length in your training corpus (which you should
know!). The option Setting For shallow models I usually set the working memory to values between 3500 and
6000 (MB), e.g. For very deep models, I first set all other parameters like It is useful to monitor the performance of your model during training on
held-out data. Just provide Attention: the validation set needs to have been preprocessed in exactly the
same manner as your training data. A minimum example of how to validate the model using cross-entropy and BLEU
score: where Early stopping is a common technique for deciding when to stop training the
model based on a heuristic involving a validation set. By default we use early stopping with patience of 10, i.e. If using multiple metrics in validation, the stopping condition can be applied
to Marian has several regularization techniques implemented that help to prevent
model overfitting, such as dropouts (Gal and Ghahramani,
2016), label smoothing (Vaswani et al.
2017), and exponential
smoothing for network
parameters. Depending on the model type, Marian support multiple types of dropout. For
RNN-based models it supports the Options For the transformer model the equivalent of Manipulation of learning rate during the training may result in better
convergence and higher-quality translations. Marian supports various strategies for decaying learning rate
( Other learning rate schedules supported by Marian: Data weighting is commonly used as a domain adaptation technique, which weights
each data item according to its proximity to the in-domain data. Marian
supports sentence and word-level data weighting strategies. Data weighting requires providing a file with weights. In sentence weighting
strategy, each line of that file contains a real-value weight: To use word weighting you should choose The tying of embedding matrices can help to reduce models size and memory
footprints during training. Tying target embeddings and the last layer of the
output does not decrease quality and helps saving significant amounts of
parameters. Tying all embedding layers and output layers is a common practice
for translation models between languages using the same scripts. Related options: Marian can handle custom embedding vectors trained with
word2vec or another tool: Embedding vectors should be provided in a file in a format similar to the
word2vec format, with word tokens replaced with words IDs from the relevant
vocabulary. Pre-trained vectors need to share the same vocabulary as your training data,
and ideally should contain vectors for Other options for managing embedding vectors: A common domain adaptation technique is continued training via fine-tuning of
an existing model on new training data. You can start continued training by copying your model to a new folder and
setting the This method also works well for normal continued training. You can interrupt
your running training, change the training corpus and run the same command you
used before for the training to resume. In the case where the training files
change, the option See also model pre-training. A transfer learning technique related to fine-tuning is initializing model
weights from a pre-trained model. Marian provides the For instance, you can initialize the decoder of a encoder-decoder translation
model with a pre-trained language model or deep models with shallow models. Marian provides an option for training on reversed input sequence via
Training with guided alignment may improve alignments produced by RNN models
( The file corpus.align from the example can be generated using the
fast_align word aligner (please refer to
their repository for installation instructions): or a RNN model and Marian has a few more options related to guided alignment training: Marian provides the Options that are automatically set via Marian supports training models with source and/or target side factors. To
train a factored model, the training data needs to be in a specific format, and
a special vocabulary is required. More information on using Marian with
factors can be found in the documentation on factored
models. Marian supports mixed precision training available in NVIDIA Volta and newer
architectures. The option Other options related to mixed precision training: Parallel training data can be provided to Marian in a tab-separated file, where
commonly the first field corresponds to the source side and the second field
corresponds to the target side of the parallel corpus, for example, instead of
providing two files to a single file can be specified with The example can be further extended to train from the corpus provided directly
into the standard input: This might be useful when using a custom tool for training data preparation.
Note that the user takes responsibility for randomizing the input data - this
is why The notion of an epoch is less clear when providing the training data into
stdin as the corpus cannot be easily rewinded and shuffled by Marian. Thus, it
is possible to define a logical epoch in terms of the number of updates or
labes, for example Training with guided alignment and data weighting is supported when providing
the corpus in stdin. Simply add new fields to the input TSV file and specify
the indices of fields with word alignments or weights. For example: All models trained with Decoding on CPU(s) is performed if To generate an n-best list with, say 10, best translations for each input
sentence, add Models of different types and architectures can be ensembled as long as they
use common vocabularies: Weights are optional and set to 1.0 by default if omitted. Batched translation generates translation for whole mini-batches and
significantly increases translation speed (roughly by a factor of 10 or more).
We recommend to use the following options to enable batched translation: This does a number of things: To give you an idea, how much faster batched translation is compared to
sentence-by-sentence translation we have collected a few numbers. Below we have
compiled the time it takes to translate the English-German WMT2013 test set
with 3000 sentences using 4 Volta GPUs on AWS.Developer API
marian-dev/doc/
folder.Installation
git clone https://github.com/marian-nmt/marian
mkdir marian/build
cd marian/build
cmake ..
make -j4
cmake -LH -N
or cmake -LAH -N
from the build
directory after running cmake ..
first.Compilation on Windows
sudo apt-get install git cmake build-essential libboost-system-dev libprotobuf17 protobuf-compiler libprotobuf-dev openssl libssl-dev libgoogle-perftools-dev
sudo apt-get install git cmake build-essential libboost-system-dev libprotobuf10 protobuf-compiler libprotobuf-dev openssl libssl-dev libgoogle-perftools-dev
sudo apt-get install git cmake build-essential libboost-system-dev zlib1g-dev libprotobuf9v5 protobuf-compiler libprotobuf-dev openssl libssl-dev libgoogle-perftools-dev
Static compilation
USE_STATIC_LIBS
is set:cd build
cmake .. -DUSE_STATIC_LIBS=on
make -j4
Custom Boost
wget https://dl.bintray.com/boostorg/release/1.67.0/source/boost_1_67_0.tar.gz
tar zxvf boost_1_67_0.tar.gz
cd boost_1_67_0
./bootstrap.sh
./b2 -j16 --prefix=$(pwd) --libdir=$(pwd)/lib64 --layout=system link=static install
--ignore-site-config
to the ./b2
command.cd /path/to/marian-dev
mkdir build
cd build
cmake .. -DBOOST_ROOT=/path/to/boost_1_67_0
make -j4
-DCOMPILE_SERVER=on
to the CMake command.Non-default CUDA
cd /path/to/marian-dev
mkdir build
cd build
cmake .. -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.1
make -j4
CPU version
wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB' | sudo apt-key add -
sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update
sudo apt-get install intel-mkl-64bit-2020.0-088
-DCOMPILE_CPU=on
to the CMake
command:cd /path/to/marian-dev
mkdir -p build
cd build
cmake .. -DCOMPILE_CPU=on
make -j4
SentencePiece
-DUSE_SENTENCEPIECE=on
to the CMake command and requires
the Protobuf library. On Ubuntu, you would need to install a couple of
packages:# Ubuntu 20.04 (Focal Fossa):
sudo apt-get install libprotobuf17 protobuf-compiler libprotobuf-dev
# Ubuntu 18.04 (Bionic Beaver):
sudo apt-get install libprotobuf10 protobuf-compiler libprotobuf-dev
# Ubuntu 16.04 LTS (Xenial Xerus):
sudo apt-get install libprotobuf9v5 protobuf-compiler libprotobuf-dev
# Ubuntu 14.04 LTS (Trusty Tahr):
sudo apt-get install libprotobuf8 protobuf-compiler libprotobuf-dev
wget https://github.com/protocolbuffers/protobuf/releases/download/v2.6.1/protobuf-cpp-2.6.1.zip
unzip protobuf-cpp-2.6.1.zip
cd protobuf-2.6.1
./autogen.sh
./configure --prefix $(pwd)
make -j4
make install
mkdir build
cd build
cmake .. -DUSE_SENTENCEPIECE=on \
-DPROTOBUF_LIBRARY=/path/to/protobuf-2.6.1/lib/libprotobuf.so \
-DPROTOBUF_INCLUDE_DIR=/path/to/protobuf-2.6.1/include \
-DPROTOBUF_PROTOC_EXECUTABLE=/path/to/protobuf-2.6.1/bin/protoc
Training
marian
command. Assuming corpus.en
and corpus.ro
are corresponding and preprocessed files of a English-Romanian
parallel corpus, the following command will create a Nematus-compatible neural
machine translation model:./build/marian \
--train-sets corpus.en corpus.ro \
--vocabs vocab.en vocab.ro \
--model model.npz
# config.yml
train-sets:
- corpus.en
- corpus.ro
vocabs:
- vocab.en
- vocab.ro
model: model.npz
./build/marian -c config.yml
Model types
s2s
: An RNN-based encoder-decoder model with attention mechanism. The
architecture is equivalent to the
DL4MT or
Senrich et al.,
2017).transformer
: A model originally proposed by Google (Vaswani et al.,
2017) based solely on attention mechanisms.multi-s2s
: As s2s
, but uses two or more encoders allowing multi-source
neural machine translation.multi-transformer
: As transformer
, but uses multiple encoders.amun
: A model equivalent to Nematus models unless layer normalization is
used. Can be decoded with Amun as nematus model type.nematus
: A model type developed for decoding deep RNN-based encoder-decoder
models created by the Edinburgh MT group for WMT 2017 using Nematus toolkit.
Can be decoded with Amun as nematus2 model type.lm
: An RNN language model.lm-transformer
: An transformer-based language model.Multi-GPU training
--devices 0 1 2 3
for training on four GPUs. There is no automatic detection
of GPUs for now.--sync-sgd
.--mini-batch
64
means 64 sentences per GPU worker.--mini-batch 256
will mean a mini-batch of 64 per worker if four GPUs are
used. This choice makes sense when you realize that synchronous SGD is
essentially working like a single GPU training process with N times more memory.
Larger mini-batches in a synchronous setting result in quite stable training.Workspace memory
--workspace
sets the size of the memory available for the forward
and backward step of the training procedure. This does not include model size
and optimizer parameters that are allocated outsize workspace. Hence you cannot
allocate all GPU memory to workspace. If you are not happy with default values
this is a trial and error process.--mini-batch 64 --max-length 100
will generate batches that contain
always 64 sentences (or less if the corpus is smaller) of up to a length of 100
tokens. Sentences longer than that are filtered out. Marian will grow workspace
memory if required and potentially exceed available memory, resulting in a
crash. Workspace memory is always rounded to multiples of 512 MB.--mini-batch-fit
overrides the specified mini-batch size and automatically
chooses the largest mini-batch for a given sentence length that fits the
specified memory. When --mini-batch-fit
is set, memory requirements are
guaranteed to fit into the specified workspace. Choosing a too small workspace
will result in small mini-batches which can prohibit learning.My rules of thumb
--workspace 5500
and then use --mini-batch-fit
which
automatically tries to make the best use of the specified memory size,
mini-batch size and sentence length.--max-length 100
,
model type, depth etc. Next I use --mini-batch-fit
and try to max out
--workspace
until I get a crash due to insufficient memory. I then revert to
the last workspace size that did not crash. Since setting --mini-batch-fit
guarantees that memory will not grow during training due to batch-size this
should result in a stable training run and maximal batch size.Validation
--valid valid.src valid.trg
for that. By default
this provide sentence-wise normalized cross-entropy scores for the validation
set every 10,000 iterations. You can change the validation frequency to, say
5000, with --valid-freq 5000
and the display frequency to 500 with
--disp-freq 500
../build/marian \
--train-sets corpus.en corpus.ro \
--vocabs vocab.en vocab.ro \
--model model.npz \
--valid-set dev.en dev.ro \
--valid-metrics cross-entropy translation \
--valid-script-path validate.sh
validate.sh
is a bash script, which takes the file with output
translation of dev.en
as the first argument (i.e. $1
) and returns the BLEU
score, for example:# validate.sh
./postprocess.sh < $1 > file.out 2>/dev/null
./moses-scripts/scripts/generic/multi-bleu-detok.perl file.ref < file.out 2>/dev/null \
| sed -r 's/BLEU = ([0-9.]+),.*/\1/'
Metrics
cross-entropy
- computes the sentence-wise normalized cross-entropy score.ce-mean-words
- computes the mean word cross-entropy score.valid-script
- executes the script specified with --valid-script-path
.
The script is expected to return a score as a floating-point number.translation
- executes the script specified with --valid-script-path
passing the name of the file with translation of the source validation set as
the first argument (e.g. $1
in Bash script, sys.argv[1]
in Python, etc.).
The script is expected to return a score as a floating-point number.bleu
- computes BLEU score on raw validation sets. Those are usually
tokenized and BPE-segmented, so the score is overestimated, and should never
be used to report your BLEU scores in a research paper.bleu-detok
- computes BLEU score on postprocessed validation sets. Requires
SentencePiece and Marian v1.6.2+.Early stopping
--early-stopping
10
. This means that training will finish if the first specified metric in
--valid-metrics
did not improve (stalled) for 10 consecutive validation
steps. Usually this will signal convergence or — if the scores get worse with
later validation steps — potential overfitting.any
or all
of these metrics. This is achieved using the flag
--early-stopping-on
. The default considers only the first
listed metric.Regularization
Dropouts
--dropout-rnn 0.2
(the numeric value of 0.2
is only provided as an example) option which uses variational dropout on all
RNN inputs and recursive states.--dropout-src
and --dropout-trg
set the probability to drop out
entire source or target word positions, respectively. These dropouts are useful
for monolingual tasks.--dropout-rnn 0.2
is
--transformer-dropout 0.2
. There are also two other dropouts for transformer
attention and transformer filter.Learning rate scheduling
--lr-decay-strategy
option). Decay factor can be specified with
--lr-decay
.
epoch
: learning rate will be decayed after each epoch starting from epoch
specified with --lr-decay-start
batches
: learning rate will be decayed every --lr-decay-freq
batches
starting after the batch specified with --lr-decay-start
stalled
: learning rate will be decayed every time when the first validation
metric does not improve for --lr-decay-start
consecutive validation stepsepoch+stalled
: learning rate will be decayed after the specified number of
epochs or stalled validation steps, whichever comes first. The option
--lr-decay-start
takes two numbers: for epochs and stalled validation
steps, respectivelybatches+stalled
: as epoch+stalled
, but the total number of batches is
taken into account instead of epochs
--lr-warmup
: learning rate will be increased linearly for the specific
number of first updates. The start value for learning rate warmup can be
specified with --lr-warmup-start-rate
.--lr-decay-inv-sqrt
: learning rate will be decreased at n / sqrt(no.
updates)
starting at n
-th updateData weighting
./build/marian \
-t corpus.{en,de} -v vocab.{en,de} -m model.npz \
--data-weighting-type sentence --data-weighting weights.txt
--data-weighting-type word
, and each
line of the weight file should contain as many real-value weights as there are
words in the corresponding target training sentence.Tied embeddings
--tied-embeddings
- tie target embeddings and output embeddings in output
layer,--tied-embeddings-src
- tie source and target embeddings,--tied-embeddings-all
- tie all embedding layers and output layer.Custom embeddings
./build/marian \
-t corpus.{en,de} -v vocab.{en,de} -m model.npz \
--embedding-vectors vectors.{en,de} --dim-emb 400
<unk>
and </s>
tokens. The easiest
way to achieve this is to prepare the training data for word2vec w.r.t your
vocabularies using marian-dev/scripts/embeddings/prepare_corpus.py
. Vectors can be prepared or
trained w.r.t to vocabulary using marian-dev/scripts/embeddings/process_word2vec.py
.
--embedding-fix-src
fixes source embeddings in all encoders--embedding-fix-trg
fixes target embeddings in all decoders--embedding-normalization
normalizes vector values into [-1,1] rangeFine-tuning
--model
option to point to that model. This will reload the model
from the path and also overwrite it during the next checkpoint saving. Note
that this overrides the model parameters with the model parameters from the
file, so the architectures cannot be changed between continued trainings.--no-restore-corpus
should be added to not restore the
corpus positions. If your validation data change, consider adding
--valid-reset-stalled
to reset validation counters. You can also change other
training parameters like learning rate or early stopping criteria. If the new
training corpus is much smaller, it is usually recommended to decrease the
learning rate and validate the model more frequently.Model pre-training
--pretrained-model
model.npz
option that will load weight matrices from the pre-trained model
that match in name corresponding parameters from the model’s architecture.
Matrices that are not present in the pre-trained model are initialized randomly
by default.Right-to-left models
--right-left
. Combining traditional left-to-right models and right-to-left
models may lead to an improved performance for some tasks. One such approach
would be to perform sequential decoding. However, combining left-to-right and
right-to-left models together in an ensemble is not possible.Guided alignment
--type amun
or s2s
) and is mandatory to obtain useful word alignments from
Transformers (--type transformer
). Guided alignment training requires
providing a file with pre-calculated word alignments for the entire training
corpus, for example:./build/marian \
-t corpus.{en,de} -v vocab.{en,de} -m model.npz \
--guided-alignment corpus.align
paste corpus.en corpus.de | sed 's/\t/ ||| /g' > corpus.en-de
fast_align/build/fast_align -vdo -i corpus.en-de > forward.align
fast_align/build/fast_align -vdor -i corpus.en-de > reverse.align
fast_align/build/atools -c grow-diag-final -i forward.align -j reverse.align > corpus.align
marian-scorer
, for example:./build/marian-scorer -m model.npz -v vocab.{en,de} -t corpus.en corpus.de > corpus.align
--guided-alignment-cost
- cost type for guided alignment--guided-alignment-weight
- weight for guided alignment cost--transformer-guided-alignment-layer
- number of layer to use for guided
alignment training; only for training transformer modelsPre-defined configurations
--task
options, which is a handy shortcut for setting
model architecture and training options for common NMT model configurations.
The list of predefined configurations includes:
best-deep
- the RNN BiDeep architecture proposed by Miceli Barone et al.
(2017)transformer-base
and transformer-big
- architectures and proposed
training settings for a Transformer “base” model and Transformer “big” model,
respectively, both introduced in Vaswani et al.
(2019)transformer-base-prenorm
and transformer-big-prenorm
- variants of two
Transformer models with “prenorm”, i.e. the layer normalization is performed
as the first block-wise preprocessing step.--task <arg>
can be overwritten by
separately specifying those options in the command line. For example, --task
transformer-base --dim-emb 1024
will train a transformer “base” but with the
embedding size of 1024 instead of 512.Factored models
Mixed precision training
--fp16
provides a shortcut with default settings
for mixed precision training with float16 and cost-scaling.
--precision
- defines types for forward/backward pass and optimization,--cost-scaling
- option values for dynamic cost scaling,--gradient-norm-average
- window size over which the exponential average of
the gradient norm is recorded,--dynamic-gradient-scaling
- re-scale gradient to have average gradient
norm if (log) gradient norm diverges from average by the given sigmas,--check-gradient-nan
- skip parameter update in case of NaNs in gradient.Training from stdin
--train-sets
:./build/marian -c config.yml -t file.src file.trg
--tsv
option:./build/marian -c config.yml --tsv -t file.src-trg
paste file.src file.trg | ./build/marian -c config.yml -t stdin --no-shuffle
--no-shuffle
is added to the training command (alternatively,
--shuffle batches
can be used).Logical epochs
--logical-epoch 1Gt
will re-define the epoch as 1 billion
target tokens, instead of the traditional one pass over the training data. This
is especially useful if the data can be provided in an infinite stream into
stdin.Guided alignment and data weighting
cat file.src-trg-aln-w | ./build/marian -t stdin --guided-alignment 2 --data-weighting 3
Translation
marian
can be decoded with marian-decoder
and
marian-server
command. Only models of type amun
and specific deep models of type
nematus
can be used with the amun
tool.Marian decoder
marian-decoder
supports translation on GPUs and CPUs. By default it
translates on the first available GPU, which can be changed with the
--devices
option. Basic usage:./build/marian-decoder -m model.npz -v vocab.en vocab.ro --devices 0 1 < input.txt
--cpu-threads N
is added:./build/marian-decoder -m model.npz -v vocab.en vocab.ro --cpu-threads 1 < input.txt
N-best lists
--n-best
and --beam-size
10` to the list of command-line
arguments:./build/marian-decoder -m model.npz -v vocab.en vocab.ro --beam-size 10 --n-best < input.txt
Ensembles
./build/marian-decoder \
--models model1.npz model2.npz model3.npz \
--weights 0.6 0.2 0.2 \
--vocabs vocab.en vocab.ro < input.txt
Batched translation
./marian-decoder -m model.npz -v vocab.src.yml vocab.trg.yml -b 6 --normalize 0.6 \
--mini-batch 64 --maxi-batch-sort src --maxi-batch 100 -w 2500
System | Single | Batched |
---|---|---|
Nematus-style Shallow RNN | 82.7s | 4.3s |
Nematus-style Deep RNN | 148.5s | 5.9s |
Google Transformer | 201.9s | 19.2s |
marian-decoder
and marian-scorer
can produce attention output or word
alignments when the --alignment
option is used with one of the following
values:
soft
: Alignment weights for all words including EOS tokens. Sets of source
token weights for target tokens are separated by a whitespace, source token
weights are separated by a comma.
echo "now everyone knows" | ./marian-decoder -c config.yml --alignment soft
jetzt weiß jeder ||| 0.917065,0.0218936,0.0405725,0.0204688 0.00803049,0.0954254,0.853882,0.0426626 \
0.0294334,0.794184,0.00511072,0.171272 0.00743875,0.0147502,0.201069,0.776743
hard
or empty: Word alignments for each target token in the form of Moses
alignments, i.e. pairs of source and target tokens.
echo "now everyone knows" | ./marian-decoder -c config.yml --alignment
jetzt weiß jeder ||| 0-0 1-2 2-1 3-3
echo "now everyone knows" | ./marian-decoder -c config.yml --alignment 0.1
jetzt weiß jeder ||| 0-0 1-2 2-1 2-3 3-2 3-3
The transformer has basically 6x8 different alignment matrices, and in theory
none of these has to be very useful for word alignment purposes. We recommend
training model with guided alignments first (--guided-alignment
) so that the
model can learn word alignments in one of its heads.
With a lexical shortlist the output vocabulary is restricted to a small subset
of translation candidates, which can improve CPU-bound efficiency. A shortlist
file, say lex.s2t, can be passed to the decoder using the --shortlist
option, for example:
./build/marian-decoder -m model.npz -v vocab.en vocab.de \
--shortlist lex.s2t 100 75 < input.txt
The second and third arguments are optional, and mean that the output vocabulary will be restricted to the 100 most frequent target words and the 75 most probable translations for every source word in a batch.
Lexical shortlist files can be generated with marian-dev/scripts/shortlist/generate_shortlists.pl
, for example:
perl generate_shortlists.pl --bindir /path/to/bin -s corpus.en -t corpus.de
where corpus.en and corpus.de are preprocessed training data, and the bin
directory contains fast_align
and atools
from
fast_align and extract_lex
from
extract-lex.
In addition to sentence-level scores, Marian can also output word-level scores.
The option --word-scores
prints one score per subword unit, for example:
echo "This is a test." | ./build/marian-decoder -c config.yml --word-scores
Tohle je test. ||| WordScores= -1.51923 -0.21951 -1.48668 -0.24813 -0.22176
Note that if you use the built-in SentencePiece subword segmentation, the
number of scores will not much the output tokens. Also, word scores are not
normalized even if --normalize
is used. You may want to normalize and map the
word scores into output tokens as a custom post-processing step. Adding
--no-spm-decode
or --alignment
will deliver all information that is needed
to do that:
echo "This is a test." | ./build/marian-decoder -c config.yml --word-scores --no-spm-decode --alignment
▁Tohle ▁je ▁test . </s> ||| 1-0 5-1 5-2 5-3 5-4 ||| WordScores= -1.51923 -0.21951 -1.48668 -0.24813 -0.22176
The option --word-scores
is also available in marian-scorer
.
The --output-sampling
option in Marian allows one to noise the output layer
with gumbel noise, which can be used for generating noisy
back-translations.
./build/marian-decoder -b 1 -i input.src --output-sampling
By default the sampling is from the full model distribution. Top-k sampling can
be achieved providing topk N
as arguments, for example:
./build/marian-decoder -b 1 -i input.src --output-sampling topk 10
Note that output sampling and beam search are generally contradictory methods
and using them together is not recommended, so we advise to set --beam-size 1
when using the sampling.
Marian has support for models in a custom binary format. This format supports
mmap loading as well as both normal and packed memory layouts. Binary models
offer decreased load times compared to .npz
, and are identifiable by their
.bin
extension.
The marian-conv
command is able to convert to and from npz
and bin
models. The memory layout of the binary model is influenced by the
--gemm-type
flag, by default this is retained as float32
.
To generate a binary model from an npz
model
./marian-conv --from model.npz --to model.bin
The basic usage is as simple as replacing model.npz
with model.bin
in your
command arguments. When decoding on CPU, it is possible to enable mmap loading
with the flag --model-mmap
.
Lexical shortlists also have a binary format. From a shortlist lex.s2t
the
binary version can be generated by
./marian-conv --shortlist lex.s2t 50 50 0 \
--dump lex.bin \
--vocabs vocab.l1.spm vocab.l2.spm
The --shortlist
argument points to the lexical shortlist file, and specifies
the first
(50) best
(50) prune
(0) options for the shortlist. Note that
these options are hardcoded into the binary shortlist at conversion! The
--dump
option gives the location for the binary shortlist and --vocabs
specifies the vocabulary files for the source (l1) and target (l2) languages.
To use the binary shortlist the --shortlist lex.s2t 50 50 0
argument in your
command should be replaced with
--shortlist lex.bin false
which provides the path to the binary shortlist lex.bin
, and the second
option false
(optional, true by default) specifies whether the contents
should be verified.
The marian-server
command starts a web-socket server providing CPU and GPU
translation service that can be requested by a client program written in Python
or any other programming language. The server uses the same command-line
options as marian-decoder
. The only addition is --port
option, which
specifies the port number:
./build/marian-server --port 8080 -m model.npz -v vocab.en vocab.ro
An example client written in Python is marian-dev/scripts/server/client_example.py
:
./scripts/server/client_example.py -p 8080 < input.txt
Note that marian-server
is not compiled by default. It requires Boost and adding
-DCOMPILE_SERVER=on
to the CMake compilation command.
Only specific types of models trained with Nematus, for example the Edinburgh WMT17
deep models can be decoded with
marian-decoder
. As such models do not include Marian-specific parameters,
all parameters related to the model architecture have to be set with
command-line options.
For example, for the de-en model this would be:
./build/marian-decoder \
--type nematus \
--models model/en-de/model.npz \
--vocabs model/en-de/vocab.de.json \
--dim-vocabs 51100 74383 \
--enc-depth 1 \
--enc-cell-depth 4 \
--enc-type bidirectional \
--dec-depth 1 \
--dec-cell-base-depth 8 \
--dec-cell-high-depth 1 \
--dec-cell gru-nematus --enc-cell gru-nematus \
--tied-embeddings true \
--layer-normalization true
Alternatively, the parameters can be added into the model .npz file based on
the Nematus .json file using the script: marian-dev/scripts/contrib/inject_model_params.py
, e.g.:
python inject_model_params.py -m model.npz -j model.npz.json
Some models released by Edinburgh might require setting other parameters as
well, for instance --dim-emb 500
.
We do not recommend training models of type nematus
with Marian. It is much
more efficient to train s2s
models, which provide the same model architecture
(except layer normalization), more features, and faster training.
The marian-scorer
tool is used for scoring (or re-scoring) parallel sentences
provided as plain texts in two corresponding files:
./build/marian-scorer -m model.npz -v vocab.{en,de} -t file.en file.de
This will print log probabilities for each sentence pair.
N-best lists can be scored using the following command:
./build/marian-scorer -m model.npz -v vocab.{en,de} \
-t file.en.txt file.de.nbest --n-best --n-best-feature F0
which adds a new score into the n-best list under the feature named F0.
The scorer can be used as a word aligner that generates word alignments for a pair of sentences:
./build/marian-scorer -m model.npz -v vocab.{en,de} \
-t file.en.txt file.de.txt --alignment
The feature works out-of-the-box for RNN models, while Transformer models need to be trained with guided alignments (see this section).
The scorer can report summarized score (cross-entropy or perplexity) for an
entire test set with option --summary
.