Our code examples refer to two related but different operations:
Using the multilanguage embeddings we provide, especially with
reference to the conText
R
package and embedding
regression.
Using the pipeline we have created to produce embeddings and the transformation matrix for any of the 157 fastText languages that you need.
We rely on speeches in the Italian parliament (2013-2020) taken from
ParlaMint to illustrate a
possible use case of our quantities. By showing how government and
opposition parties differentially adjusted their speeches around issues
of immigration following the 2015 refugee crisis in Europe, we
illustrate how our ALC resources can be used to make inferences about
semantic differences across time and groups. Our resources are fully
integrated with the conText
R
package. You find more information on how to get started
with conText
here.
library(conText)
# other libraries used in this guide
library(quanteda)
library(dplyr)
library(data.table)
library(readr)
library(ggplot2)
# Transformation matrix
# --------------------------------
transform <- readRDS("data/fasttext_transform_itwiki_25.rds")
# fastText pretrained embeddings
# --------------------------------
not_all_na <- function(x) any(!is.na(x))
fasttext <- setDT(read_delim("data/fasttext_vectors_itwiki.vec",
delim = " ",
quote = "",
skip = 1,
col_names = F,
col_types = cols())) %>%
dplyr::select(where(not_all_na)) # remove last column which is all NA
word_vectors <- as.matrix(fasttext, rownames = 1)
colnames(word_vectors) = NULL
rm(fasttext)
# ParlaMint data - some preparation
# --------------------------------
data <- readRDS("data/parlamint_it.rds")
# for preprocessing: trim documents by length, remove punctuation and symbols,
# and remove infrequent tokens and stops
corpus <- corpus(data) %>%
corpus_trim(what = "documents",
min_ntoken = 10)
toks <- tokens(corpus, remove_punct=T, remove_symbols=T) %>%
tokens_tolower()
# without stops
toks_nostop <- tokens_select(toks, pattern = stopwords("it"),
selection = "remove", min_nchar=3)
# only use features that appear at least 10 times in the corpus
feats <- dfm(toks_nostop, tolower=T, verbose = FALSE) %>%
dfm_trim(min_termfreq = 10) %>% featnames()
toks_nostop <- tokens_select(toks_nostop, feats, padding = TRUE)
We now analyze how the large and unexpected influx of refugees to Southern Europe starting in September 2015 affected discussions around immigration in the Italian government and how this semantic shift differed across government and opposition parties at the time.
immigr*
across government and
opposition partiesA good first exploratory step is to analyze the nearest neighbors of
the ALC embeddings by groups, i.e. features with the highest
cosine-similarity with each group embedding using
conText::get_nns()
(a wrapper function to
conText::nns()
). In our example, we are interested in the
nearest neighbors to the ALC embedding of the wordstem
immigr
across government and oppositions parties and across
time. We use the candidates
argument to limit the set of
features we want get_nns()
to consider as candidate nearest
neighbors. In our case we limit candidates to those features that appear
in the context window around the target term immigr
(we
could also allow this set to incorporate the entire corpus or all
features in the pretrained embeddings).
The results suggest that government and opposition parties differ little in their connotation of “immigration” across the entire sample period. Yet, speakers in the Italian parliament were more prone to speak of “immigration” in the context of “sustainability”, “incentives” or “social security” right after the refugee shock, whereas they were more likely to speak of general immigration issues (“applicants”, “workers”, “immigration”) in other months.
target_toks <- tokens_context(x = toks_nostop, pattern = "immigr*", window = 5L)
## 377 instances of "immigrati" found.
## 19 instances of "immigrato" found.
## 280 instances of "immigrazione" found.
## 15 instances of "immigrazioni" found.
feats <- featnames(dfm(target_toks))
# nearest neighbors
# ---------------------------------
# by government vs. opposition
target_nns <- get_nns(x = target_toks, N = 10,
groups = docvars(target_toks, 'government'),
candidates = feats,
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
bootstrap = F) %>%
lapply(., "[[",2) %>%
do.call(rbind, .) %>%
as.data.frame()
target_nns[, 1:5]
## V1 V2 V3 V4 V5
## 1 dell'immigrazione richiedenti all'immigrazione immigrazione emergenziale
## 0 dell'immigrazione richiedenti all'immigrazione immigrazione l'immigrazione
# by month
target_nns <- get_nns(x = target_toks, N = 10,
groups = docvars(target_toks, 'Moy'),
candidates = feats,
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
bootstrap = F) %>%
lapply(., "[[",2) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "Moy") %>%
arrange(lubridate::ym(Moy))
target_nns[19:25, 1:5]
## Moy V1 V2 V3 V4
## 19 2015-07 richiedenti emergenziale richiedente chiediamo
## 20 2015-08 richiedenti richiedente lavoratori migranti
## 21 2015-09 richiedenti emergenziale richiedente pregiudiziale
## 22 2015-10 dell'immigrazione immigrazione all'immigrazione l'immigrazione
## 23 2015-11 ventimiglia francia invadere respingere
## 24 2015-12 incentiva previdenziali emergenziale sostenibile
## 25 2016-01 richiedenti richiedente migranti immigrazione
Another exploratory exercise is to compute the cosine similarity
ratio between group embeddings and features using
conText::get_nns_ratio()
(a wrapper function for
conText::nns_ratio()
). Given ALC embeddings for two groups,
get_nns_ratio()
first computes the similarity between a
feature and each group embedding for any given feature, and then takes
the ratio of these two similarities. This ratio captures how
“discriminant” a feature is of a given group. Values larger (smaller)
than 1 mean the feature is more (less) discriminant of the group in the
numerator (denominator). Use the numerator
argument to
define which group represents the numerator in this ratio. If
N
is defined, this ratio is computed for the union of the
top N nearest neighbors.
In this example we look at period-specific cosine similarity ratios
for immigr*
across government and opposition parties,
particularly the months right before and right after the 2015 refugee
crisis started in Italy. The results suggest that both types of
parliamentary camps discussed issues of immigration in similar ways in
early 2015, often sharing nearest neighbors such as emergency
(emergenziale
) or applicants (richiedenti
). In
the later months of 2015, in contrast, the vocabularies are radically
different between government and opposition parties. While opposition
parties still seem to talk about immigration in more general terms
(e.g. invoking terms lexically related to immigrazione
),
government parties now mention normative challenges of immigration as
well as legal constraints, e.g. the Schengen area or the “Bossi-Fini
law”.
plotfun <- function(period){
# we subset the tokens object to the periods we want
temp <- tokens_subset(target_toks, months==period)
# features as candidates
feats <- featnames(dfm(target_toks))
# adjusting docvars for plotting purposes
docvars(temp)$Government = ifelse(docvars(temp)$government==1, "Government", "Opposition")
set.seed(111)
target_nns_ratio <- get_nns_ratio(x = temp,
N = 10,
groups = docvars(temp, 'Government'),
numerator = "Government",
candidates = feats,
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
bootstrap = T,
num_bootstraps = 100,
permute = T,
num_permutations = 100,
verbose = FALSE)
plot <- plot_nns_ratio(x = target_nns_ratio, alpha = 0.05, horizontal = T)
return(plot)
}
(plot_20150106 <- plotfun(period = 2))
## starting bootstraps
## done with bootstraps
## starting permutations
## done with permutations
(plot_20150912 <- plotfun(period = 3))
## starting bootstraps
## done with bootstraps
## starting permutations
## done with permutations
Finally, we evaluate the trend in semantic differences across
government and opposition parties around the 2015 refugee crisis using
embedding regression. conText::conText()
uses ALC
embeddings within a regression-style framework, i.e. it allows to
examine covariate effects on embeddings beyond discrete group variables
or while controlling for other covariates.
In our example we estimate semantic differences around issues of immigration across government and opposition parties and by periods. In line with the earlier results on cosine similarity ratios, the normed regression estimates shown below clearly indicate that speakers from different parliamentary camps differ throughout the entire period, and most strongly in the months between September and December 2015—a period with large and unexpected waves of refugees arriving in Southern Europe.
# estimate embedding regression by period
set.seed(2021L)
models <- lapply(unique(docvars(target_toks, 'months')), function(j){
conText(formula = . ~ government,
data = tokens_subset(target_toks, months == j),
pre_trained = word_vectors,
transform = TRUE,
transform_matrix = transform,
stratify = T,
bootstrap = TRUE,
num_bootstraps = 100,
permute = TRUE,
num_permutations = 100,
hard_cut = F,
window = 5,
case_insensitive = TRUE,
verbose = T)
})
## total observations included in regression: 174
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 1.77527 0.1794097 1.492813 2.046208 0.1
## total observations included in regression: 66
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.761237 0.2458782 2.395422 3.136497 0.03
## total observations included in regression: 130
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 1.793092 0.1682247 1.546029 2.08536 0.07
## total observations included in regression: 34
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 4.046238 0.5333392 3.244508 4.941884 0
## total observations included in regression: 77
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.292316 0.2156901 1.989034 2.698041 0
## total observations included in regression: 44
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.802802 0.3245191 2.299766 3.357747 0.07
## total observations included in regression: 103
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 2.191998 0.1770688 1.918852 2.519986 0
## total observations included in regression: 63
## starting bootstrapping
## done with bootstrapping
## starting permutations
## done with permutations
## coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1 government 3.05894 0.3204981 2.535519 3.622745 0.15
# plot these normed beta estimates
plot_tibble <- lapply(models, function(i) i@normed_coefficients) %>%
do.call(rbind, .) %>%
mutate(period = factor(seq(1, 8), labels = c("2014-01/06", "2014-07/12",
"2015-01/08", "2015-09/12",
"2016-01/06", "2016-07/12",
"2017-01/06", "2017-07/12")))
ggplot(data = plot_tibble,
aes(x = period,
y = normed.estimate)) +
geom_point() +
geom_errorbar(aes(ymin = lower.ci,
ymax = upper.ci),
width = 0.5) +
geom_vline(xintercept = 3.5, linetype = "dashed") +
labs(x = "",
title = "Norm of Difference between Government
and Opposition ALC embeddings of 'immigr*'",
y = latex2exp::TeX("Norm of $\\hat{\\beta}$"))+
theme_bw()
All training code is available here. Here is a list of the relevant code files:
000_master.sbatch
file is the master file that
executes the training for all languages. For each language it specifies
the exact Wikipedia corpus used (the month of the Wikipedia file) and
the minimum frequency cutoff enforced when training the underlying
embeddings as well as the transformation matrix.000_all_latin.sbatch
file is used for all languages
written using the Latin, Cyrillic, Hebrew or Greek scripts.000_all_ja.sbatch
file is used for Japanese and
uses the MeCab
tokenizer.000_all_zh.sbatch
file is used for (traditional)
Chinese and uses the Stanford word
segmenter.000_all_icu.sbatch
file is used for all remaining
languages and uses the ICU tokenizer.When training the models (fastText, GloVe and their respective ALC embeddings), we apply a hard minimal frequency threshold for the respective vocabulary. This helps to clean out noisy parts of the corpus and thus significantly improves the fit of all models. We base our choice on the language-specific threshold on the size of the Wikipedia corpora and vocabulary by language. Specifically, we impose a minimal frequency cutoff of 50 for English, 25 for medium-sized languages (i.e. German, Spanish, Italian, French, Russian, Swedish and Dutch), 15 for small-to-medium-sized languages (i.e. Czech, Finish, Hungarian, Portuguese, Korean, Arabic) and 10 for all smaller languages. As this step turned out to be crucial for out-of-sample performance of our quantities, scholars who use our code pipeline to train resources from Wikipedia for their language might want to experiment with the size of the threshold in their particular case.
The logic of the training files is as follows:
wiki.txt
file.${LAN}wiki_ppfinal.txt
.GloVe
model using this preprocessed
and tokenized text. We set a language-specific minimal word frequency
described in our manuscript, a vector size of 300 and a context size of
5. We further impose similar parameters as in Pennington, Socher and Manning
(2014), i.e. we set fastText
and GloVe
embeddings, we then train ALC embeddings to obtain the relevant
transformation matrices using trainA_chunks_wiki.R
. To
handle the large size of the respective corpora, we use a chunk-based
learning approach. That is, we read in the relevant preprocessed corpus
by chunk and perform the following operations by chunk:conText
, with a window size of 5 and equal weightingTo obtain the un-transformed additive embeddings for all features across the entire corpus, we then simply average the chunk-specific additive embeddings for each feature across the chunks. This is possible, because the additive context embeddings from step 3 are themselves simple averages of the respective instance-specific additive context embeddings in a given chunk. We do this for all features appearing with a frequency of at least the language-specific threshold across the entire corpus. Finally, we train the corresponding transformation matrix with log-weighting.
It is highly recommended to train the resources on a high-performance server with sufficient RAM and parallelization availability. When using the training pipeline, remember to adjust all directories in the training code.