Text extracted via OCR from the original document. May contain errors from the scanning process.
c. Using the online Wikipedia website, find all Wikipedia articles which are listed in the body
of the chosen Wikipedia lists.
d. Intersect the set of all articles belonging to the relevant Lists and Categories with the set
of people both 1800-1980. For people in both sets, append the occupation information.
e. Associate the records of these articles with the occupation.
l11.7.A.3 - Extraction of individuals appearing in Encyclopedia Britannica.
Encyclopedia Britannica is a hand-curated, high quality encyclopedic dataset with many detailed
biographical entries. We obtained, in a private communication, structured datasets from Encyclopedia
Britannica Inc. These datasets contain a complete record of all entries relating to individuals in the
Encyclopedia Britannica. Each record contains the birth and death of the person at hand, as well as set of
information snippets summarizing the most critical biographical information available within the
encyclopedia.
For the analysis of fame, we extract, from the dataset provided by Encyclopedia Britannica Inc.,
records of individuals born in between 1800 and 1980. For every person, we retain, as a measure of their
notability, a count of the number of biographical snippets present in the dataset. Figure $10b outlines the
number of records parsed from the Encyclopedia Britannica dataset, as well as the number of these
records ultimately retained for final analysis. Table S8 displays examples of records parsed in this step of
the analysis procedure.
3) Create a database of records referring to people born 1800-1980 in Encyclopedia
Britannica.
a. Using the internal database records provided by Encyclopedia Britannica Inc., find all
entries referring to individuals born 1700-1980. Only people both in 1800-1980 are used
for the purposes of fame analysis. People born in 1700-1799 are used to identify naming
ambiguities as described in section III.7.A.7 of this Supplementary Material.
b. For these entries, create a record identified by a unique integer containing the individual’s
full name, as listed in the encyclopedia, and the individual's birth year.
c. For every record, find the number of encyclopedic informational snippets present in the
Encyclopedia Britannica dataset. Append this count to the record.
l11.7.A.4 — Produce spelling variants of the full names of individuals.
We ultimately wish to identify the most relevant name used to commonly refer to an individual. Given the
limits of OCR and the specificities of the method used to create the word frequency database, certain
typographic elements such as accents, hyphens or quotation marks can complicate this process. As
such, for every full name present in our database of people, we append variants of the full names where
these typographic elements have been removed or, when possible, replaced. Table S9 presents
examples of spelling variants for multiple names.
4) In both databases, for every record, create a set of raw names variants. To create the set:
a. Include the original raw name.
b. If the name includes apostrophes or quotation marks, include a variant where these
elements are removed.
c. If the first word in the name contains a hyphen, include a name where this hyphen is
replaced with a whitespace.
d._ If the last word of the name is a numeral, include a name where this numeral has been
removed.
e. For every element in the set which contains non-Latin characters, include a variant where
this characters have been replaced using the closest Latin equivalent.
21
HOUSE_OVERSIGHT_017029