This website provides data and code on ALC embeddings in multiple languages from the project by Wirsching, Rodriguez, Spirling, and Stewart.
Specifically, this site gathers the quantities for ALC embeddings in various languages. For each language, we provide...
- A new version of fastText embeddings, fit to Wikipedia corpora (as opposed to the "original" fastText that uses The Common Crawl data.)
- Resulting fastText subword embedding models for post-estimation (helpful for obtaining embeddings for terms not in Wikipedia corpora using subword information; see here for more information)
- GloVe embeddings, fit to Wikipedia corpora
- Two lightweight ALC transformation matrices (300 x 300) corresponding to these pre-trained embeddings and necessary to build ALC embeddings
The Data link has more information. Under Code you can find code for fitting and using these models.
The accompanying paper is titled "Multilanguage Word Embeddings for Social Scientists: Estimation, Inference and Validation Resources for 157 Languages" and the current version is here and supporting information is here.