← Home
Data

You can browse the resources for all languages here. For links to specific languages, see the table below. The links will take you to a Google drive folder, where for any given language you will find…

a subfolder called fastText

This contains:

  1. fasttext_transform_[]wiki_[threshold].rds which is the 300 x 300 transformation matrix for the fastText embeddings with the specified minimum frequency threshold
  2. fasttext_vectors_[]wiki.vec which is the underlying fastText embedding matrix (of dimensions vocabulary size x 300)
  3. fasttext_model_[]wiki.bin which is "our" fastText model, trained on the relevant language Wikipedia (rather than Common Crawl). Specifically, this file contains the subword model information that can be used to obtain embeddings for out-of-sample terms.

a subfolder called gloVe

This contains:

  1. glove_transform_[]wiki.rds which is the 300 x 300 transformation matrix for the GloVe embeddings
  2. glove_vectors_[]wiki.txt which is the underlying GloVe embedding matrix (of dimensions vocabulary size x 300)
         
Arabic Bengali Bulgarian Catalan Chinese (traditional)
Czech Danish Dutch Egyptian Arabic English
Estonian Finish French German Greek (modern)
Hebrew Hindi Hungarian Indonesian Irish
Italian Japanese Khmer Korean Latvian
Lithuanian Maltese Norwegian Polish Portuguese
Romania Russian Slovak Slovenian Spanish
Swahili Swedish Ukrainian Urdu Vietnamese

References

If you use these resources, please cite this paper:

Pedro L. Rodriguez, Arthur Spirling, Brandon M. Stewart, Elisa M. Wirsching "Multilanguage Word Embeddings for Social Scientists: Estimation, Inference and Validation Resources for 157 Languages". Working paper. (2024)