Training embeddings
finalfrontier
finalfrontier is a program for training word embeddings. finalfrontier currently has the following features:
- Noise contrastive estimation (Gutmann and Hyvärinen, 2012)
- Subword representations (Bojanowski et al., 2016)
- Hogwild SGD (Recht et al., 2011)
- Models:
- skip-gram (Mikolov et al., 2013)
- structured skip-gram (Ling et al., 2015)
- directional skip-gram (Song et al., 2018)
- dependency (Levy and Goldberg, 2014)
- Output formats:
- finalfusion
- fastText
- word2vec binary
- word2vec text
- GloVe
Getting finalfrontier
We provide precompiled finalfrontier releases for Linux. If you use another platform, follow the build instructions.
Quickstart
Skip-gram models
Train a model with 300-dimensional word embeddings, the structured skip-gram model, discarding words that occur fewer than 10 times, in 10 epochs, using 16 threads:
$ finalfrontier skipgram --dims 300 --model structgram --epochs 10 --mincount 10 \
--threads 16 corpus.txt corpus-embeddings.fifu
The format of the input file is simple: tokens are separated by
spaces, sentences by newlines (\n
). After training, you can use and
query the embeddings with
finalfusion and
finalfusion-utils
:
$ finalfusion similar corpus-embeddings.fifu
Dependency embeddings
Train embeddings with dimensionality 200 on corpus.conll
using the
dependency model from contexts with depth up to 2:
$ finalfrontier deps --dependency-depth 2 --normalize-context \
--dims 200 corpus.conll dewiki-deps.bin
The input file should be in CoNLL-U format.