← Home

Data

You can browse the resources for all languages here. For links to specific languages, see the table below. The links will take you to a Google drive folder, where for any given language you will find…

a subfolder called `fastText`

This contains:

fasttext_transform_[]wiki_[threshold].rds which is the 300 x 300 transformation matrix for the fastText embeddings with the specified minimum frequency threshold
fasttext_vectors_[]wiki.vec which is the underlying fastText embedding matrix (of dimensions vocabulary size x 300)
fasttext_model_[]wiki.bin which is "our" fastText model, trained on the relevant language Wikipedia (rather than Common Crawl). Specifically, this file contains the subword model information that can be used to obtain embeddings for out-of-sample terms.

a subfolder called `gloVe`

This contains:

glove_transform_[]wiki.rds which is the 300 x 300 transformation matrix for the GloVe embeddings
glove_vectors_[]wiki.txt which is the underlying GloVe embedding matrix (of dimensions vocabulary size x 300)


Arabic	Bengali	Bulgarian	Catalan	Chinese (traditional)
Czech	Danish	Dutch	Egyptian Arabic	English
Estonian	Finish	French	German	Greek (modern)
Hebrew	Hindi	Hungarian	Indonesian	Irish
Italian	Japanese	Khmer	Korean	Latvian
Lithuanian	Maltese	Norwegian	Polish	Portuguese
Romania	Russian	Slovak	Slovenian	Spanish
Swahili	Swedish	Ukrainian	Urdu	Vietnamese

References

If you use these resources, please cite this paper:

Wirsching EM, Rodriguez PL, Spirling A, Stewart BM. Multilanguage Word Embeddings for Social Scientists: Estimation, Inference, and Validation Resources for 157 Languages. Political Analysis. Published online 2024:1-8. doi:10.1017/pan.2024.17

← Home

a subfolder called fastText

a subfolder called gloVe

References

a subfolder called `fastText`

a subfolder called `gloVe`