Our code examples refer to two related but different operations:
-
Using the multilanguage embeddings we provide, especially with reference to the conText
R
package and embedding regression. -
Using the pipeline we have created to produce embeddings and the transformation matrix for any of the 157 fastText languages that you need.
1. Working with Multilanguage Embeddings
We rely on speeches in the Italian parliament (2013-2020) taken from
ParlaMint to illustrate a possible
use case of our quantities. By showing how government and opposition
parties differentially adjusted their speeches around issues of
immigration following the 2015 refugee crisis in Europe, we illustrate
how our ALC resources can be used to make inferences about semantic
differences across time and groups. Our resources are fully integrated
with the conText R
package. You find more information on how to get started with conText
here.
Load package
library(conText)
# other libraries used in this guide
library(quanteda)
library(dplyr)
library(data.table)
library(readr)
library(ggplot2)
Load quantities
# Transformation matrix
# --------------------------------
transform <- readRDS("data/fasttext_transform_itwiki_25.rds")
# fastText pretrained embeddings
# --------------------------------
not_all_na <- function(x) any(!is.na(x))
fasttext <- setDT(read_delim("data/fasttext_vectors_itwiki.vec",
delim = " ",
quote = "",
skip = 1,
col_names = F,
col_types = cols())) %>%
dplyr::select(where(not_all_na)) # remove last column which is all NA
word_vectors <- as.matrix(fasttext, rownames = 1)
colnames(word_vectors) = NULL
rm(fasttext)
Load data
# ParlaMint data - some preparation
# --------------------------------
data <- readRDS("data/parlamint_it.rds")
# for preprocessing: trim documents by length, remove punctuation and symbols,
# and remove infrequent tokens and stops
corpus <- corpus(data) %>%
corpus_trim(what = "documents",
min_ntoken = 10)
toks <- tokens(corpus, remove_punct=T, remove_symbols=T) %>%
tokens_tolower()
# without stops
toks_nostop <- tokens_select(toks, pattern = stopwords("it"),
selection = "remove", min_nchar=3)
# only use features that appear at least 10 times in the corpus
feats <- dfm(toks_nostop, tolower=T, verbose = FALSE) %>%
dfm_trim(min_termfreq = 10) %>% featnames()
toks_nostop <- tokens_select(toks_nostop, feats, padding = TRUE)
Discussion of immigration
We now analyze how the large and unexpected influx of refugees to Southern Europe starting in September 2015 affected discussions around immigration in the Italian government and how this semantic shift differed across government and opposition parties at the time.
Nearest neighbors of immigr*
across government and opposition parties
A good first exploratory step is to analyze the nearest neighbors of the
ALC embeddings by groups, i.e. features with the highest
cosine-similarity with each group embedding using conText::get_nns()
(a wrapper function to conText::nns()
). In our example, we are
interested in the nearest neighbors to the ALC embedding of the wordstem
immigr
across government and oppositions parties and across time. We
use the candidates
argument to limit the set of features we want
get_nns()
to consider as candidate nearest neighbors. In our case we
limit candidates to those features that appear in the context window
around the target term immigr
(we could also allow this set to
incorporate the entire corpus or all features in the pretrained
embeddings).
The results suggest that government and opposition parties differ little in their connotation of “immigration” across the entire sample period. Yet, speakers in the Italian parliament were more prone to speak of “immigration” in the context of “sustainability”, “incentives” or “social security” right after the refugee shock, whereas they were more likely to speak of general immigration issues (“applicants”, “workers”, “immigration”) in other months.
target_toks <- tokens_context(x = toks_nostop, pattern = "immigr*", window = 5L)
## 377 instances of "immigrati" found.
## 19 instances of "immigrato" found.
## 280 instances of "immigrazione" found.
## 15 instances of "immigrazioni" found.
feats <- featnames(dfm(target_toks))
# nearest neighbors
# ---------------------------------
# by government vs. opposition
target_nns <- get_nns(x = target_toks, N = 10,
groups = docvars(target_toks, 'government'),
candidates = feats,
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
bootstrap = F) %>%
lapply(., "[[",2) %>%
do.call(rbind, .) %>%
as.data.frame()
target_nns[, 1:5]
## V1 V2 V3 V4 V5
## 1 dell'immigrazione richiedenti all'immigrazione immigrazione emergenziale
## 0 dell'immigrazione richiedenti all'immigrazione immigrazione l'immigrazione
# by month
target_nns <- get_nns(x = target_toks, N = 10,
groups = docvars(target_toks, 'Moy'),
candidates = feats,
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
bootstrap = F) %>%
lapply(., "[[",2) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "Moy") %>%
arrange(lubridate::ym(Moy))
target_nns[19:25, 1:5]
## Moy V1 V2 V3 V4
## 19 2015-07 richiedenti emergenziale richiedente chiediamo
## 20 2015-08 richiedenti richiedente lavoratori migranti
## 21 2015-09 richiedenti emergenziale richiedente pregiudiziale
## 22 2015-10 dell'immigrazione immigrazione all'immigrazione l'immigrazione
## 23 2015-11 ventimiglia francia invadere respingere
## 24 2015-12 incentiva previdenziali emergenziale sostenibile
## 25 2016-01 richiedenti richiedente migranti immigrazione
Nearest neighbors cosine similarity ratios across parties and time
Another exploratory exercise is to compute the cosine similarity ratio
between group embeddings and features using conText::get_nns_ratio()
(a wrapper function for conText::nns_ratio()
). Given ALC embeddings
for two groups, get_nns_ratio()
first computes the similarity between
a feature and each group embedding for any given feature, and then takes
the ratio of these two similarities. This ratio captures how
“discriminant” a feature is of a given group. Values larger (smaller)
than 1 mean the feature is more (less) discriminant of the group in the
numerator (denominator). Use the numerator
argument to define which
group represents the numerator in this ratio. If N
is defined, this
ratio is computed for the union of the top N nearest neighbors.
In this example we look at period-specific cosine similarity ratios for
immigr*
across government and opposition parties, particularly the
months right before and right after the 2015 refugee crisis started in
Italy. The results suggest that both types of parliamentary camps
discussed issues of immigration in similar ways in early 2015, often
sharing nearest neighbors such as emergency (emergenziale
) or
applicants (richiedenti
). In the later months of 2015, in contrast,
the vocabularies are radically different between government and
opposition parties. While opposition parties still seem to talk about
immigration in more general terms (e.g. invoking terms lexically related
to immigrazione
), government parties now mention normative challenges
of immigration as well as legal constraints, e.g. the Schengen area or
the “Bossi-Fini law”.
plotfun <- function(period){
# we subset the tokens object to the periods we want
temp <- tokens_subset(target_toks, months==period)
# features as candidates
feats <- featnames(dfm(target_toks))
# adjusting docvars for plotting purposes
docvars(temp)$Government = ifelse(docvars(temp)$government==1, "Government", "Opposition")
set.seed(111)
target_nns_ratio <- get_nns_ratio(x = temp,
N = 10,
groups = docvars(temp, 'Government'),
numerator = "Government",
candidates = feats,
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
bootstrap = T,
num_bootstraps = 100,
permute = T,
num_permutations = 100,
verbose = FALSE)
plot <- plot_nns_ratio(x = target_nns_ratio, alpha = 0.05, horizontal = T)
return(plot)
}
(plot_20150106 <- plotfun(period = 2))
## starting bootstraps
## done with bootstraps
## starting permutations
## done with permutations
(plot_20150912 <- plotfun(period = 3))
## starting bootstraps
## done with bootstraps
## starting permutations
## done with permutations
Embedding regression
Finally, we evaluate the trend in semantic differences across government
and opposition parties around the 2015 refugee crisis using embedding
regression. conText::conText()
uses ALC embeddings within a
regression-style framework, i.e. it allows to examine covariate effects
on embeddings beyond discrete group variables or while controlling for
other covariates.
In our example we estimate semantic differences around issues of immigration across government and opposition parties and by periods. In line with the earlier results on cosine similarity ratios, the normed regression estimates shown below clearly indicate that speakers from different parliamentary camps differ throughout the entire period, and most strongly in the months between September and December 2015—a period with large and unexpected waves of refugees arriving in Southern Europe.
# estimate embedding regression by period
set.seed(2021L)
models <- lapply(unique(docvars(target_toks, 'months')), function(j){
conText(formula = . ~ government,
data = tokens_subset(target_toks, months == j),
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
stratify = T,
bootstrap = TRUE,
num_bootstraps = 100,
permute = TRUE,
num_permutations = 100,
hard_cut = F,
window = 5,
case_insensitive = TRUE,
verbose = T)
})
## total observations included in regression: 174
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 1.77527 0.1794097 1.492813 2.046208 0.1
## total observations included in regression: 66
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.761237 0.2458782 2.395422 3.136497 0.03
## total observations included in regression: 130
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 1.793092 0.1682247 1.546029 2.08536 0.07
## total observations included in regression: 34
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 4.046238 0.5333392 3.244508 4.941884 0
## total observations included in regression: 77
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.292316 0.2156901 1.989034 2.698041 0
## total observations included in regression: 44
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.802802 0.3245191 2.299766 3.357747 0.07
## total observations included in regression: 103
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.191998 0.1770688 1.918852 2.519986 0
## total observations included in regression: 63
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 3.05894 0.3204981 2.535519 3.622745 0.15
# plot these normed beta estimates
plot_tibble <- lapply(models, function(i) i@normed_coefficients) %>%
do.call(rbind, .) %>%
mutate(period = factor(seq(1, 8), labels = c("2014-01/06", "2014-07/12",
"2015-01/08", "2015-09/12",
"2016-01/06", "2016-07/12",
"2017-01/06", "2017-07/12")))
ggplot(data = plot_tibble,
aes(x = period,
y = normed.estimate)) +
geom_point() +
geom_errorbar(aes(ymin = lower.ci,
ymax = upper.ci),
width = 0.5) +
geom_vline(xintercept = 3.5, linetype = "dashed") +
labs(x = "",
title = "Norm of Difference between Government
and Opposition ALC embeddings of 'immigr*'",
y = latex2exp::TeX("Norm of $\\hat{\\beta}$"))+
theme_bw()
2. Working with the Training Pipeline
All training code is available here. Here is a list of the relevant code files:
- The
000_master.sbatch
file is the master file that executes the training for all languages. For each language it specifies the exact Wikipedia corpus used (the month of the Wikipedia file) and the minimum frequency cutoff enforced when training the underlying embeddings as well as the transformation matrix. - The
000_all_latin.sbatch
file is used for all languages written using the Latin, Cyrillic, Hebrew or Greek scripts. - The
000_all_ja.sbatch
file is used for Japanese and uses the MeCab tokenizer. - The
000_all_zh.sbatch
file is used for (traditional) Chinese and uses the Stanford word segmenter. - The
000_all_icu.sbatch
file is used for all remaining languages and uses the ICU tokenizer.
When training the models (fastText, GloVe and their respective ALC embeddings), we apply a hard minimal frequency threshold for the respective vocabulary. This helps to clean out noisy parts of the corpus and thus significantly improves the fit of all models. We base our choice on the language-specific threshold on the size of the Wikipedia corpora and vocabulary by language. Specifically, we impose a minimal frequency cutoff of 50 for English, 25 for medium-sized languages (i.e. German, Spanish, Italian, French, Russian, Swedish and Dutch), 15 for small-to-medium-sized languages (i.e. Czech, Finish, Hungarian, Portuguese, Korean, Arabic) and 10 for all smaller languages. As this step turned out to be crucial for out-of-sample performance of our quantities, scholars who use our code pipeline to train resources from Wikipedia for their language might want to experiment with the size of the threshold in their particular case.
The logic of the training files is as follows:
- We first download and extract the Wikipedia corpus using the
WikiExtractor. This
provides us with the raw, unprocessed text of the relevant Wikipedia
corpus, saved in the
wiki.txt
file. - We then preprocess the raw text by removing punctuation (outside of
tokens), setting to lower case and removing extra white space. We
further tokenize the raw text (precedures depend on language). This
preprocessed text is saved in
${LAN}wiki_ppfinal.txt
. - We now train the
GloVe
model using this preprocessed and tokenized text. We set a language-specific minimal word frequency described in our manuscript, a vector size of 300 and a context size of 5. We further impose similar parameters as in Pennington, Socher and Manning (2014), i.e. we set , and a maximum iteration of . - We then fastText models for our preprocessed and tokenized text using a context window of 5 and setting the dimensions of the word vectors to 300. For the dictionary, we impose the minimal frequency of occurrences in the entire corpus described in our manuscript, and use negative sampling of size 10.
- Finally, for both
fastText
andGloVe
embeddings, we then train ALC embeddings to obtain the relevant transformation matrices usingtrainA_chunks_wiki.R
. To handle the large size of the respective corpora, we use a chunk-based learning approach. That is, we read in the relevant preprocessed corpus by chunk and perform the following operations by chunk:
- Retain vocabulary with a minimum term frequency of the language-specific threshold
- Create a feature-cooccurrence-matrix (FCM) using
conText
, with a window size of 5 and equal weighting - Obtain a corresponding feature-embedding-matrix that provides additive context-specific feature embeddings, averaged over all embedding instances in a given chunk.
To obtain the un-transformed additive embeddings for all features across the entire corpus, we then simply average the chunk-specific additive embeddings for each feature across the chunks. This is possible, because the additive context embeddings from step 3 are themselves simple averages of the respective instance-specific additive context embeddings in a given chunk. We do this for all features appearing with a frequency of at least the language-specific threshold across the entire corpus. Finally, we train the corresponding transformation matrix with log-weighting.
It is highly recommended to train the resources on a high-performance server with sufficient RAM and parallelization availability. When using the training pipeline, remember to adjust all directories in the training code.