What can we learn from 100,000 novels, over 300 years?
Ted Underwood, David Bamman, and Sabrina Lee have done something incredible, they’ve analysed 100,000 novels over the past 300 years. By analysing both authors and fictive characters, they provide key insights into historical patriarchy.
Their sample comprises English-language texts in the HathiTrust Digital Library and the ChicagoText Lab. This reflects the book-practices of academic libraries (with additional contributions from the Library of Congress and New York Public Library). While not a complete picture of all books ever written, it probably does reflect demand.
Methodologically, Underwood and colleagues use a pipeline called ‘BookNLP’, which identifies character names and then clusters those names (‘Elizabeth’ and ‘Elizabeth Bennett’ become a single person). Descriptive words are then connected to each character - in terms of her actions, adjectives and nouns. BookNLP also appears very accurate in describing gender. Women are identified with 95% precision.
Harnessing this Large Language Model, Underwood and colleagues explore how the characters of popular stories have changed over the past three centuries. This is a fantastic technological advance, as it prohibits earlier cherry-picking.
So what do they find?