finalfusion specification
Version 0
Goals
finalfusion is a format for storing word embeddings. The goals of the
first version of the finalfusion format are:
- Easy to parse
- Fast to parse
- Extensible
- Support for:
    - Memory mapping
- Tokens with spaces
- Subword units
- Quantized matrices
 
- Existing embeddings should be convertible
File format
Each finalfusion file consists of a header, followed by chunks. Currently,
a finalfusion file must contain the following chunk order:
- Optional metadata chunk
- Vocabulary chunk
- Storage chunk
The permitted chunks may be extended in a future version of the specification. In particular, we would like to make it possible:
- To have multiple storage chunks per vocabulary.
- To have multiple vocab-storage pairs.
All data must be in little endian byte order.
Header
The header consists of:
- 4 bytes of magic: ['F', 'i', 'F', 'u']
- Format version number: u32
- Number of chunks: u32 (n_chunks)
- Chunk identifiers: [u32; n_chunks]
Data types
0: i8
1: u8
2: i16
3: u16
4: i32
5: u32
6: i64
7: u64
8: i128
9: u128
10: f32
11: f64
Chunks
Chunk format
The chunk format is as follows:
- Chunk identifier: u32
- Chunk data length: u64
- Chunk data: n bytes
Simple Vocab
- Chunk identifier: 1
- Vocab length: u64 (vocab_len)
- vocab_lentimes:- word length in bytes: u32 (word_len)
- word_lentimes u8.
 
- word length in bytes: u32 (
Finalfusion subword vocab
- Chunk identifier: 3
- Minimum n-gram length: u32
- Maximum n-gram length: u32
- Bucket exponent: u32
- Vocab length: u64 (vocab_len)
- vocab_lentimes:- word length in bytes: u32 (word_len)
- word_lentimes u8.
 
- word length in bytes: u32 (
fastText subword vocab
- Chunk identifier: 7
- Minimum n-gram length: u32
- Maximum n-gram length: u32
- Number of buckets: u32
- Vocab length: u64 (vocab_len)
- vocab_lentimes:- word length in bytes: u32 (word_len)
- word_lentimes u8.
 
- word length in bytes: u32 (
Explicit n-gram vocab
- Chunk identifier: 8
- Minimum n-gram length: u32
- Maximum n-gram length: u32
- Vocab length: u64 (vocab_len)
- vocab_lentimes:- word length in bytes: u32 (word_len)
- word_lentimes u8.
 
- word length in bytes: u32 (
- N-gram vocab length: u64 (n_ngrams)
- n_ngramstimes:- n-gram length in bytes: u32 (ngram_len)
- ngram_lentimes u8.
 
- n-gram length in bytes: u32 (
Embedding matrix
- Chunk identifier: 2
- Shape:
    - Rows: u64 (n_rows)
- Cols: u32 (n_cols)
 
- Rows: u64 (
- Data type: u32 (data_type)
- Padding, such that data is at a multiple of size_of::<data_type>().
- Data: n_row*n_cols*sizeof(data_type)
Quantized embedding matrix
- Chunk identifier: 4
- Use projection (0 or 1): u32
- Use norms (0 or 1): u32
- Quantized embedding length: u32 (quantized_len)
- Reconstructed embedding length: u32 (reconstructed_len)
- Number of quantizer centroids: u32
- Quantized matrix rows: u64 (matrix_rows)
- Quantized matrix type: u32 (quantized_type)
- Reconstruced matrix type: u32 (reconstructed_type)
- Padding, such that data is at a multiple of the largest matrix data type.
- Projection matrix: reconstructed_lenxreconstructed_lenxsizeof(reconstructed_type)
- Subquantizers: quantized_lenx (reconstructed_len/quantized_len) xsizeof(quantized_type)
- Norms: matrix_rowsxsizeof(reconstructed_type)
- Quantized embedding matrix: matrix_rowsxquantized_lenxsizeof(reconstructed_type)
Metadata
- Chunk identifier: 5
- UTF8 encoded Metadata in TOML format: chunk_data_lentimesu8
NdNorms
- Chunk identifier: 6
- Number of norms: u64 (n_norms)
- Data type: u32 (data_type)
- Padding, such that data is at a multiple of size_of::<data_type>().
- Data: n_norms*sizeof(data_type)