Text extracted via OCR from the original document. May contain errors from the scanning process.
language lexica, we tried whenever possible to have the annotation performed by a third party with no
knowledge of the analyses we were undertaking
IlI.3. Controls
To confirm the quality of our data in the English language, we sought positive controls in the form of
words that should exhibit very strong peaks around a date of interest. We used three categories of such
words: heads of state (‘President Truman’), treaties (‘Treaty of Versailles’), and geographical name
change (‘Byelorussia’ to ‘Belarus’). We used Wikipedia as a primary source of such words, and manually
curated the lists as described below. We computed the timeserie of each n-gram, centered it on the date
of interest (year when the person became president, for instance), and normalized the timeserie by
overall frequency. Then, we took the mean trajectory for each of the three cohorts, and plotted in Figure
S5.
The list of heads of states include all US presidents and British monarchs who gained power in the 19" or
20" centuries (we removed ambiguous names, such as ‘President Roosevelt’). The list of treaties is taken
from the list of 198 treaties signed in the 19" or 20" centuries (S7); but we kept only the 121 names that
referred to only one known treaty, and that have non zero timeseries. The list of country name changes is
taken from Ref S8. The lists are given in APPENDIX.
The correspondence between the expected and observed presence of peaks was excellent. 42 out of 44
heads of state had a frequency increase of over 10-fold in the decade after they took office (expected if
the year of interest was random: 1). Similarly, 85 out of 92 treaties had a frequency increase of over 10-
fold in the decade after they were signed (expected: 2). Last, 23 out of 28 new country names became
more frequent than the country name they replaced within 3 years of the name change; exceptions
include Kampuchea/Cambodia (the name Cambodia was later reinstated), Iran/Persia (Iran is still today
referred to as Persia in many contexts) and Sri Lanka/Ceylon (Ceylon is also a popular tea).
III.4. Lexicon Analysis
II.4A. Estimation of the number of 1-grams defined in leading
dictionaries of the English language.
(a) American Heritage Dictionary of the English Language, 4th Edition (2000)
We are indebted to the editorial staff of AHD4 for providing us the list of the 153,459 headwords that
make up the entries of AHD4. However, many headwords are not single words (“preferential voting” or
“men’s room”), and others are listed as many times as there are grammatical categories (“to console’, the
verb; “console”, the piece of furniture).
Among those entries, we find 116,156 unique 1-grams (such as “materialism” or “extravagate”).
(b) Webster's Third New International Dictionary (2002)
The editorial staff communicated to us the number of “boldface entries” of the dictionary, which are taken
to be the number of n-grams defined: 476,330.
The editorial staff also communicated the number of multi-word entries 74,000 out of a total number of
entries 275,000. They estimate a lower bound of multi-word entries at 27% of the entries.
Therefore, we estimate an upper bound of unique 1-grams defined by this dictionary as 0.27*476,330,
which is approximately 348,000.
15
HOUSE_OVERSIGHT_017023