alc logo

This website provides data and code on ALC embeddings in multiple languages from the project by Wirsching, Rodriguez, Spirling, and Stewart.

Specifically, this site gathers the quantities for ALC embeddings in various languages. For each language, we provide...


   - A new version of fastText embeddings, fit to Wikipedia corpora (as opposed to the "original" fastText that uses The Common Crawl data.)
   - Resulting fastText subword embedding models for post-estimation (helpful for obtaining embeddings for terms not in Wikipedia corpora using subword information; see here for more information)
   - GloVe embeddings, fit to Wikipedia corpora
   - Two lightweight ALC transformation matrices (300 x 300) corresponding to these pre-trained embeddings and necessary to build ALC embeddings


The Data link has more information. Under Code you can find code for fitting and using these models.

The accompanying paper is titled "Multilanguage Word Embeddings for Social Scientists: Estimation, Inference and Validation Resources for 157 Languages" and it is now published at Political Analysis here. Supporting information is here. Here is the (current) citation:

Wirsching EM, Rodriguez PL, Spirling A, Stewart BM. Multilanguage Word Embeddings for Social Scientists: Estimation, Inference, and Validation Resources for 157 Languages. Political Analysis. Published online 2024:1-8. doi:10.1017/pan.2024.17