Training embeddings

finalfrontier

finalfrontier is a program for training word embeddings. finalfrontier currently has the following features:

  • Noise contrastive estimation (Gutmann and Hyvärinen, 2012)
  • Subword representations (Bojanowski et al., 2016)
  • Hogwild SGD (Recht et al., 2011)
  • Models:
    • skip-gram (Mikolov et al., 2013)
    • structured skip-gram (Ling et al., 2015)
    • directional skip-gram (Song et al., 2018)
    • dependency (Levy and Goldberg, 2014)
  • Output formats:
    • finalfusion
    • fastText
    • word2vec binary
    • word2vec text
    • GloVe

Getting finalfrontier

We provide precompiled finalfrontier releases for Linux. If you use another platform, follow the build instructions.

Quickstart

Skip-gram models

Train a model with 300-dimensional word embeddings, the structured skip-gram model, discarding words that occur fewer than 10 times, in 10 epochs, using 16 threads:

$ finalfrontier skipgram --dims 300 --model structgram --epochs 10 --mincount 10 \
  --threads 16 corpus.txt corpus-embeddings.fifu

The format of the input file is simple: tokens are separated by spaces, sentences by newlines (\n). After training, you can use and query the embeddings with finalfusion and finalfusion-utils:

$ finalfusion similar corpus-embeddings.fifu

Dependency embeddings

Train embeddings with dimensionality 200 on corpus.conll using the dependency model from contexts with depth up to 2:

$ finalfrontier deps --dependency-depth 2 --normalize-context \
  --dims 200 corpus.conll dewiki-deps.bin

The input file should be in CoNLL-U format.