Nabokov’s Favorite Word is Mauve

Published by

on

This is an unusual book. I’m not aware of any other attempt to analyze literature using big data and statistics.

For example: Ernest Hemingway, together with the world’s English teachers, warns against the use of adverbs. He argues that adverbs, especially those that end in –ly, are a sign of lazy writing. The reader should be able to tell whether a character is sleepy without having to specify that they are doing something sleepily.

Did Hemingway follow his own advice? In Nabokov’s Favorite Word is Mauve, data journalist Ben Blatt analyzes Hemingway’s novels and found that he indeed used –ly adverbs less than other authors. In his novels, they appear at a rate of 80 per 10,000 words. Other writers vary widely, with e.g. 140 per 10,000 in J. K. Rowling’s Harry Potter novels.

Blatt also analyzes other stylistic choices such as the use of thought verbs (thinks, knows, understands, realizes, believes, wants, remembers, imagines, desires, loves, hates), exclamation points and the use of suddenly and qualifiers like rather, very or little in hundreds of novels. Hemingway, Mark Twain, Toni Morrison and Chuck Palahniuk appear towards the top of multiple of the resulting style rankings, confirming what those who care already knew: They are good at what they do.

Unlike other popular science books, Blatt’s consists of original research. The bibliography lists 1,500 literary fiction or bestselling popular fiction novels, but they’re only used as the raw material for Blatt’s data analysis. One side effect of presenting original research is that there are more figures than in other books. Most pages contain at least one graph or table. While they’re clear and informative, the text that links them together sometimes appears redundant.

Nabokov’s Favorite Word is Mauve left me with enough questions to hope for a sequel. For example, Blatt describes an approach that can determine a book’s author by comparing the frequency of common words like andthethen or these. It works well on novels, but what about shorter texts? How much better would a machine learning approach perform that not only uses word frequencies but also word co-occurrence, sentence structure and other metrics? Could it tell this blog post was written by me when given my previous posts?

But what I really want – data journalists take note – is a website where, for any given text, I can explore different metrics like adverb abuse, Flesch-Kincaid readability or gender-specific pronoun use in interactive graphs.