Skip to main content
Skip to content
Case File
kaggle-ho-017014House Oversight

Technical assessment of metadata and OCR quality in Google Books corpus

Technical assessment of metadata and OCR quality in Google Books corpus The document details internal quality metrics and filtering thresholds for Google Books metadata and OCR. It contains no references to influential actors, financial flows, or misconduct, offering no actionable investigative leads. Key insights: Metadata date errors reduced from 27% to 6.2% after filtering.; OCR quality scores assigned per volume (0‑100) using a PPM‑based model.; Different OCR quality thresholds applied by language (e.g., 80% for Latin alphabets).

Date
Unknown
Source
House Oversight
Reference
kaggle-ho-017014
Pages
1
Persons
0
Integrity
No Hash Available

Summary

Technical assessment of metadata and OCR quality in Google Books corpus The document details internal quality metrics and filtering thresholds for Google Books metadata and OCR. It contains no references to influential actors, financial flows, or misconduct, offering no actionable investigative leads. Key insights: Metadata date errors reduced from 27% to 6.2% after filtering.; OCR quality scores assigned per volume (0‑100) using a PPM‑based model.; Different OCR quality thresholds applied by language (e.g., 80% for Latin alphabets).

Tags

kagglehouse-oversightmetadataocr-qualitygoogle-booksdata-accuracy

Forum Discussions

This document was digitized, indexed, and cross-referenced with 1,400+ persons in the Epstein files. 100% free, ad-free, and independent.

Annotations powered by Hypothesis. Select any text on this page to annotate or highlight it.