Skip to main content
Skip to content
Case File
d-16874House OversightOther

Methodology for Lexicon Controls and Frequency Analysis of Historical Terms

The passage describes technical validation methods for a linguistic study, mentioning heads of state, treaties, and country name changes, but provides no actionable leads, allegations, or connections Uses known historical terms as positive controls for frequency spikes. Analyzes n‑gram frequency around dates of presidential terms, treaty signings, and country name chan Reports high correlation be

Date
November 11, 2025
Source
House Oversight
Reference
House Oversight #017023
Pages
1
Persons
0
Integrity
No Hash Available

Summary

The passage describes technical validation methods for a linguistic study, mentioning heads of state, treaties, and country name changes, but provides no actionable leads, allegations, or connections Uses known historical terms as positive controls for frequency spikes. Analyzes n‑gram frequency around dates of presidential terms, treaty signings, and country name chan Reports high correlation be

Tags

methodologydata-validationhouse-oversightlinguisticshistorical-ngrams

Ask AI About This Document

0Share
PostReddit

Extracted Text (OCR)

EFTA Disclosure
Text extracted via OCR from the original document. May contain errors from the scanning process.
language lexica, we tried whenever possible to have the annotation performed by a third party with no knowledge of the analyses we were undertaking IlI.3. Controls To confirm the quality of our data in the English language, we sought positive controls in the form of words that should exhibit very strong peaks around a date of interest. We used three categories of such words: heads of state (‘President Truman’), treaties (‘Treaty of Versailles’), and geographical name change (‘Byelorussia’ to ‘Belarus’). We used Wikipedia as a primary source of such words, and manually curated the lists as described below. We computed the timeserie of each n-gram, centered it on the date of interest (year when the person became president, for instance), and normalized the timeserie by overall frequency. Then, we took the mean trajectory for each of the three cohorts, and plotted in Figure S5. The list of heads of states include all US presidents and British monarchs who gained power in the 19" or 20" centuries (we removed ambiguous names, such as ‘President Roosevelt’). The list of treaties is taken from the list of 198 treaties signed in the 19" or 20" centuries (S7); but we kept only the 121 names that referred to only one known treaty, and that have non zero timeseries. The list of country name changes is taken from Ref S8. The lists are given in APPENDIX. The correspondence between the expected and observed presence of peaks was excellent. 42 out of 44 heads of state had a frequency increase of over 10-fold in the decade after they took office (expected if the year of interest was random: 1). Similarly, 85 out of 92 treaties had a frequency increase of over 10- fold in the decade after they were signed (expected: 2). Last, 23 out of 28 new country names became more frequent than the country name they replaced within 3 years of the name change; exceptions include Kampuchea/Cambodia (the name Cambodia was later reinstated), Iran/Persia (Iran is still today referred to as Persia in many contexts) and Sri Lanka/Ceylon (Ceylon is also a popular tea). III.4. Lexicon Analysis II.4A. Estimation of the number of 1-grams defined in leading dictionaries of the English language. (a) American Heritage Dictionary of the English Language, 4th Edition (2000) We are indebted to the editorial staff of AHD4 for providing us the list of the 153,459 headwords that make up the entries of AHD4. However, many headwords are not single words (“preferential voting” or “men’s room”), and others are listed as many times as there are grammatical categories (“to console’, the verb; “console”, the piece of furniture). Among those entries, we find 116,156 unique 1-grams (such as “materialism” or “extravagate”). (b) Webster's Third New International Dictionary (2002) The editorial staff communicated to us the number of “boldface entries” of the dictionary, which are taken to be the number of n-grams defined: 476,330. The editorial staff also communicated the number of multi-word entries 74,000 out of a total number of entries 275,000. They estimate a lower bound of multi-word entries at 27% of the entries. Therefore, we estimate an upper bound of unique 1-grams defined by this dictionary as 0.27*476,330, which is approximately 348,000. 15

Technical Artifacts (1)

View in Artifacts Browser

Email addresses, URLs, phone numbers, and other technical indicators extracted from this document.

SWIFT/BICAPPENDIX

Forum Discussions

This document was digitized, indexed, and cross-referenced with 1,400+ persons in the Epstein files. 100% free, ad-free, and independent.

Annotations powered by Hypothesis. Select any text on this page to annotate or highlight it.