← Home

Code

Our code examples refer to two related but different operations:

Using the multilanguage embeddings we provide, especially with reference to the conText R package and embedding regression.
Using the pipeline we have created to produce embeddings and the transformation matrix for any of the 157 fastText languages that you need.

1. Working with Multilanguage Embeddings

We rely on speeches in the Italian parliament (2013-2020) taken from ParlaMint to illustrate a possible use case of our quantities. By showing how government and opposition parties differentially adjusted their speeches around issues of immigration following the 2015 refugee crisis in Europe, we illustrate how our ALC resources can be used to make inferences about semantic differences across time and groups. Our resources are fully integrated with the conText R package. You find more information on how to get started with conText here.

Load package

library(conText)

# other libraries used in this guide
library(quanteda)
library(dplyr)
library(data.table)
library(readr)
library(ggplot2)

Load quantities

# Transformation matrix 
# --------------------------------
transform <- readRDS("data/fasttext_transform_itwiki_25.rds")

# fastText pretrained embeddings
# --------------------------------
not_all_na <- function(x) any(!is.na(x))
fasttext <-  setDT(read_delim("data/fasttext_vectors_itwiki.vec",
                              delim = " ",
                              quote = "",
                              skip = 1,
                              col_names = F,
                              col_types = cols())) %>%
  dplyr::select(where(not_all_na)) # remove last column which is all NA
word_vectors <-  as.matrix(fasttext, rownames = 1)
colnames(word_vectors) = NULL
rm(fasttext)

Load data

# ParlaMint data - some preparation
# --------------------------------
data <- readRDS("data/parlamint_it.rds") 

# for preprocessing: trim documents by length, remove punctuation and symbols, 
# and remove infrequent tokens and stops

corpus <- corpus(data) %>% 
  corpus_trim(what = "documents",
              min_ntoken = 10)

toks <- tokens(corpus, remove_punct=T, remove_symbols=T) %>% 
  tokens_tolower()

# without stops
toks_nostop <- tokens_select(toks, pattern = stopwords("it"), 
                             selection = "remove", min_nchar=3)

# only use features that appear at least 10 times in the corpus
feats <- dfm(toks_nostop, tolower=T, verbose = FALSE) %>%
  dfm_trim(min_termfreq = 10) %>% featnames()
toks_nostop <- tokens_select(toks_nostop, feats, padding = TRUE)

Discussion of immigration

We now analyze how the large and unexpected influx of refugees to Southern Europe starting in September 2015 affected discussions around immigration in the Italian government and how this semantic shift differed across government and opposition parties at the time.

Nearest neighbors of `immigr*` across government and opposition parties

A good first exploratory step is to analyze the nearest neighbors of the ALC embeddings by groups, i.e. features with the highest cosine-similarity with each group embedding using conText::get_nns() (a wrapper function to conText::nns()). In our example, we are interested in the nearest neighbors to the ALC embedding of the wordstem immigr across government and oppositions parties and across time. We use the candidates argument to limit the set of features we want get_nns() to consider as candidate nearest neighbors. In our case we limit candidates to those features that appear in the context window around the target term immigr (we could also allow this set to incorporate the entire corpus or all features in the pretrained embeddings).

The results suggest that government and opposition parties differ little in their connotation of “immigration” across the entire sample period. Yet, speakers in the Italian parliament were more prone to speak of “immigration” in the context of “sustainability”, “incentives” or “social security” right after the refugee shock, whereas they were more likely to speak of general immigration issues (“applicants”, “workers”, “immigration”) in other months.

target_toks <- tokens_context(x = toks_nostop, pattern = "immigr*", window = 5L)

## 377 instances of "immigrati" found.
## 19 instances of "immigrato" found.
## 280 instances of "immigrazione" found.
## 15 instances of "immigrazioni" found.

feats <- featnames(dfm(target_toks))


# nearest neighbors
# ---------------------------------
# by government vs. opposition 
target_nns <- get_nns(x = target_toks, N = 10,
                      groups = docvars(target_toks, 'government'),
                      candidates = feats,
                      pre_trained = word_vectors,
                      transform = TRUE,
                      transform_matrix = transform,
                      bootstrap = F) %>% 
  lapply(., "[[",2) %>% 
  do.call(rbind, .) %>% 
  as.data.frame()
target_nns[, 1:5]

##                  V1          V2               V3           V4             V5
## 1 dell'immigrazione richiedenti all'immigrazione immigrazione   emergenziale
## 0 dell'immigrazione richiedenti all'immigrazione immigrazione l'immigrazione

# by month
target_nns <- get_nns(x = target_toks, N = 10,
                      groups = docvars(target_toks, 'Moy'),
                      candidates = feats,
                      pre_trained = word_vectors,
                      transform = TRUE,
                      transform_matrix = transform,
                      bootstrap = F) %>% 
  lapply(., "[[",2) %>% 
  do.call(rbind, .) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "Moy") %>%
  arrange(lubridate::ym(Moy))
target_nns[19:25, 1:5]

##        Moy                V1            V2               V3             V4
## 19 2015-07       richiedenti  emergenziale      richiedente      chiediamo
## 20 2015-08       richiedenti   richiedente       lavoratori       migranti
## 21 2015-09       richiedenti  emergenziale      richiedente  pregiudiziale
## 22 2015-10 dell'immigrazione  immigrazione all'immigrazione l'immigrazione
## 23 2015-11       ventimiglia       francia         invadere     respingere
## 24 2015-12         incentiva previdenziali     emergenziale    sostenibile
## 25 2016-01       richiedenti   richiedente         migranti   immigrazione

Nearest neighbors cosine similarity ratios across parties and time

Another exploratory exercise is to compute the cosine similarity ratio between group embeddings and features using conText::get_nns_ratio() (a wrapper function for conText::nns_ratio()). Given ALC embeddings for two groups, get_nns_ratio() first computes the similarity between a feature and each group embedding for any given feature, and then takes the ratio of these two similarities. This ratio captures how “discriminant” a feature is of a given group. Values larger (smaller) than 1 mean the feature is more (less) discriminant of the group in the numerator (denominator). Use the numerator argument to define which group represents the numerator in this ratio. If N is defined, this ratio is computed for the union of the top N nearest neighbors.

In this example we look at period-specific cosine similarity ratios for immigr* across government and opposition parties, particularly the months right before and right after the 2015 refugee crisis started in Italy. The results suggest that both types of parliamentary camps discussed issues of immigration in similar ways in early 2015, often sharing nearest neighbors such as emergency (emergenziale) or applicants (richiedenti). In the later months of 2015, in contrast, the vocabularies are radically different between government and opposition parties. While opposition parties still seem to talk about immigration in more general terms (e.g. invoking terms lexically related to immigrazione), government parties now mention normative challenges of immigration as well as legal constraints, e.g. the Schengen area or the “Bossi-Fini law”.

plotfun <- function(period){
  # we subset the tokens object to the periods we want
  temp <- tokens_subset(target_toks, months==period)
  # features as candidates
  feats <- featnames(dfm(target_toks))
  # adjusting docvars for plotting purposes
  docvars(temp)$Government = ifelse(docvars(temp)$government==1, "Government", "Opposition")
  set.seed(111)
  target_nns_ratio <- get_nns_ratio(x = temp,
                                    N = 10,
                                    groups = docvars(temp, 'Government'),
                                    numerator = "Government",
                                    candidates = feats,
                                    pre_trained = word_vectors,
                                    transform = TRUE,
                                    transform_matrix = transform,
                                    bootstrap = T,
                                    num_bootstraps = 100,
                                    permute = T,
                                    num_permutations = 100,
                                    verbose = FALSE)
  
  plot <- plot_nns_ratio(x = target_nns_ratio, alpha = 0.05, horizontal = T)
  return(plot)
}

(plot_20150106 <- plotfun(period = 2))

## starting bootstraps 
## done with bootstraps 
## starting permutations 
## done with permutations

(plot_20150912 <- plotfun(period = 3))

## starting bootstraps 
## done with bootstraps 
## starting permutations 
## done with permutations

Embedding regression

Finally, we evaluate the trend in semantic differences across government and opposition parties around the 2015 refugee crisis using embedding regression. conText::conText() uses ALC embeddings within a regression-style framework, i.e. it allows to examine covariate effects on embeddings beyond discrete group variables or while controlling for other covariates.

In our example we estimate semantic differences around issues of immigration across government and opposition parties and by periods. In line with the earlier results on cosine similarity ratios, the normed regression estimates shown below clearly indicate that speakers from different parliamentary camps differ throughout the entire period, and most strongly in the months between September and December 2015—a period with large and unexpected waves of refugees arriving in Southern Europe.

# estimate embedding regression by period
set.seed(2021L)
models <- lapply(unique(docvars(target_toks, 'months')), function(j){
  conText(formula =  . ~ government, 
          data = tokens_subset(target_toks, months == j), 
          pre_trained = word_vectors,
          transform = TRUE,
          transform_matrix = transform,
          stratify = T,
          bootstrap = TRUE,
          num_bootstraps = 100,
          permute = TRUE,
          num_permutations = 100,
          hard_cut = F,
          window = 5,
          case_insensitive = TRUE,
          verbose = T)
})

## total observations included in regression: 174 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government         1.77527 0.1794097 1.492813 2.046208     0.1
## total observations included in regression: 66 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        2.761237 0.2458782 2.395422 3.136497    0.03
## total observations included in regression: 130 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        1.793092 0.1682247 1.546029  2.08536    0.07
## total observations included in regression: 34 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        4.046238 0.5333392 3.244508 4.941884       0
## total observations included in regression: 77 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        2.292316 0.2156901 1.989034 2.698041       0
## total observations included in regression: 44 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        2.802802 0.3245191 2.299766 3.357747    0.07
## total observations included in regression: 103 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        2.191998 0.1770688 1.918852 2.519986       0
## total observations included in regression: 63 
## starting bootstrapping 
## done with bootstrapping 
## starting permutations 
## done with permutations 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government         3.05894 0.3204981 2.535519 3.622745    0.15

# plot these normed beta estimates
plot_tibble <- lapply(models, function(i) i@normed_coefficients) %>% 
  do.call(rbind, .) %>% 
  mutate(period = factor(seq(1, 8), labels = c("2014-01/06", "2014-07/12",
                                               "2015-01/08", "2015-09/12", 
                                               "2016-01/06", "2016-07/12",
                                               "2017-01/06", "2017-07/12")))
ggplot(data = plot_tibble,
       aes(x = period, 
           y = normed.estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = lower.ci,
                    ymax = upper.ci),
                width = 0.5) +
  geom_vline(xintercept = 3.5, linetype = "dashed") + 
  labs(x = "", 
       title = "Norm of Difference between Government 
       and Opposition ALC embeddings of 'immigr*'",
       y = latex2exp::TeX("Norm of $\\hat{\\beta}$"))+
  theme_bw()

2. Working with the Training Pipeline

All training code is available here. Here is a list of the relevant code files:

The 000_master.sbatch file is the master file that executes the training for all languages. For each language it specifies the exact Wikipedia corpus used (the month of the Wikipedia file) and the minimum frequency cutoff enforced when training the underlying embeddings as well as the transformation matrix.
The 000_all_latin.sbatch file is used for all languages written using the Latin, Cyrillic, Hebrew or Greek scripts.
The 000_all_ja.sbatch file is used for Japanese and uses the MeCab tokenizer.
The 000_all_zh.sbatch file is used for (traditional) Chinese and uses the Stanford word segmenter.
The 000_all_icu.sbatch file is used for all remaining languages and uses the ICU tokenizer.

When training the models (fastText, GloVe and their respective ALC embeddings), we apply a hard minimal frequency threshold for the respective vocabulary. This helps to clean out noisy parts of the corpus and thus significantly improves the fit of all models. We base our choice on the language-specific threshold on the size of the Wikipedia corpora and vocabulary by language. Specifically, we impose a minimal frequency cutoff of 50 for English, 25 for medium-sized languages (i.e. German, Spanish, Italian, French, Russian, Swedish and Dutch), 15 for small-to-medium-sized languages (i.e. Czech, Finish, Hungarian, Portuguese, Korean, Arabic) and 10 for all smaller languages. As this step turned out to be crucial for out-of-sample performance of our quantities, scholars who use our code pipeline to train resources from Wikipedia for their language might want to experiment with the size of the threshold in their particular case.

The logic of the training files is as follows:

We first download and extract the Wikipedia corpus using the WikiExtractor. This provides us with the raw, unprocessed text of the relevant Wikipedia corpus, saved in the wiki.txt file.
We then preprocess the raw text by removing punctuation (outside of tokens), setting to lower case and removing extra white space. We further tokenize the raw text (precedures depend on language). This preprocessed text is saved in ${LAN}wiki_ppfinal.txt.
We now train the GloVe model using this preprocessed and tokenized text. We set a language-specific minimal word frequency described in our manuscript, a vector size of 300 and a context size of 5. We further impose similar parameters as in Pennington, Socher and Manning (2014), i.e. we set $xmax = 100$ , $\alpha = 3/4$ and a maximum iteration of $50$ .
We then fastText models for our preprocessed and tokenized text using a context window of 5 and setting the dimensions of the word vectors to 300. For the dictionary, we impose the minimal frequency of occurrences in the entire corpus described in our manuscript, and use negative sampling of size 10.
Finally, for both fastText and GloVe embeddings, we then train ALC embeddings to obtain the relevant transformation matrices using trainA_chunks_wiki.R. To handle the large size of the respective corpora, we use a chunk-based learning approach. That is, we read in the relevant preprocessed corpus by chunk and perform the following operations by chunk:

Retain vocabulary with a minimum term frequency of the language-specific threshold
Create a feature-cooccurrence-matrix (FCM) using conText, with a window size of 5 and equal weighting
Obtain a corresponding feature-embedding-matrix that provides additive context-specific feature embeddings, averaged over all embedding instances in a given chunk.

To obtain the un-transformed additive embeddings for all features across the entire corpus, we then simply average the chunk-specific additive embeddings for each feature across the chunks. This is possible, because the additive context embeddings from step 3 are themselves simple averages of the respective instance-specific additive context embeddings in a given chunk. We do this for all features appearing with a frequency of at least the language-specific threshold across the entire corpus. Finally, we train the corresponding transformation matrix with log-weighting.

It is highly recommended to train the resources on a high-performance server with sufficient RAM and parallelization availability. When using the training pipeline, remember to adjust all directories in the training code.