What the Coronavirus Can Teach Us About Data Visualization: How to Avoid Common Biases

Information = Data + Bias. This formula is another mathematical nugget of data visualization wisdom from my professor, Dr. Harold Kurstedt. You met Dr. Kurstedt in my Platinum Rule post. This formula came back to me when I was stress-watching cable news about the novel coronavirus pandemic recently.

I saw this MSNBC graph first and saw that my home state of Virginia was in the category with the highest number of cases (31+). At that moment, I was not feeling safe to be living in a state with a high number of cases. Note: These numbers were from March 15, and they now seem quaint!

Then when the commercial hit, I flipped over to CNN. Their map had Virginia as a “yellow” state, the second-lowest category. Phew! It’s not so bad, after all. But wait, the number of cases in Virginia didn’t change. The only thing that changed is the graphic designer’s choice in bucketing the numbers. This is bias in action.

Today, “bias” is almost universally a negative term. I checked multiple dictionaries, and even the most neutral definitions ended with something like, “usually in a way considered to be unfair.” The word “bias” is derived from the French term for “a slant.” In lawn bowling, it referred to a ball that was weighted to one side, causing it to curve when thrown. Hmm, I guess bias always referred to cheating. So, let’s update the formula:

Information = Data + Interpretation.

I do an exercise in my workshops where I ask participants to write a sentence giving the main point of a graph. In every session, smart participants come up with valid sentences that have different points. The example graph is not that complex, yet various participants focus on at least five different points. If you throw a chart on the screen without an interpretation, your audience will come up with multiple interpretations, and chaos will ensue. You’re talking about your point, but your audience is busy interpreting the graph for themselves.

There’s lots of bad bias.

When you use data to support your ideas, you are biasing (sorry, interpreting) data to convert it to information. There’s no shame in that … unless you cheat. Here are my favorite ways people cheat with data:

Confirmation Bias: Ignoring data that contradicts your hypothesis. A doctor who suspects a patient has cancer may ask questions that confirm that diagnosis while overlooking evidence that would point to a different diagnosis.
Availability Bias: Only using easy-to-get data. The old joke about the drunk looking for his keys illustrates this bias. He stumbles out of the bar and realizes he doesn’t have his keys and looks for them under the streetlight. His friend asks him where he last had his keys.
- “Over by my car.”
- “So, why are you looking here?”
- “Because the light is better.”

Getting data is hard. Resist the temptation to use the data you have when it won’t do the job.

Confounding Variables: Failing to understand all drivers of a relationship. It is a statistical fact that ice cream and violent crime are correlated. Does that mean ice cream causes violent crime? While brain freezes drive me crazy, this is called a spurious correlation. Ice cream sales and violent crime increase in the summer with warmer temperatures. Make sure you understand your actual root cause.

These biases are just the tip of the bias iceberg. There’s also selection bias, over/underfitting, interpretation bias, prediction bias, information bias, anchoring, framing, sunk costs, the halo effect, status quo anchoring, prospect theory, loss aversion, overconfidence, and the Moses illusion. Ask your local psychology major for more information.

But, Wait, There’s More

While we’re critiquing cable news maps, let’s make a couple more observations:

Time: One hundred more cases were diagnosed in six minutes?! If true, that’s exponential growth in action. More likely, the graph builders had different vintages of data. Putting the “as of” time for the data sample in the footnote solves this problem.
Language: The MSNBC map highlighted “fatalities” in a summary box, while CNN called out the number of “dead” in each state. This highlights two interpretation choices. First, “fatalities” is a euphemism for death. Euphemisms create a softer tone. That’s not inherently wrong or right, but it’s an interpretation choice. Second, to paraphrase a politics cliché, “all coronavirus is local.” As we observe the spread of the virus, we have different reactions based on the locations of confirmed cases:
- China: Wow, that sucks, glad it’s not here.
- Europe: Crud, that could impact my summer vacation.
- US: Glad I live in a small city.
- Virginia: Am I a “red” state or a “yellow” state?
- Richmond, Virginia: Wait, I moved here to avoid big-city problems.

MSNBC chose to focus on the total deaths, giving a more significant number. CNN showed deaths by state, highlighting how close the impact of the virus is to you.

Data Availability: The focus on confirmed cases is an admirable focus on the facts. However, these maps are ignoring two other facts:
- We know that not everyone with the virus is showing symptoms.
- We know that not everyone with symptoms can get tested.

In this case, a desire to stick to the facts leads us to portray the problem as smaller than it is.

How to Avoid These Issues

To avoid these issues, follow these best practices:

Know your biases. Understand the natural biases that humans face and be honest with yourself about how those biases might affect your data visualization.
Be clear about your interpretation. The best way to make your interpretation clear is to put it on the page. Use a “so what” sentence title or include interpretive text on your visualization. If you’re not used to writing sentence titles, see this blog post: Death to the Category Header.
Qualify your visualization. Show the source, the date, and the scope of your data. Footnotes are a great place for these.
Use precise language. Avoid euphemisms. Choose descriptive language.
Highlight where your data is incomplete. Be clear that you are giving an incomplete picture because you don’t have the complete data. Or forecast/interpolate the data based on what you have but be clear what assumptions you are making.

Data visualizations are playing an influential role in helping us understand this rapidly changing pandemic. “Flatten the Curve” has entered the vernacular thanks to this CDC graphic. And this long-form series of visualizations from Medium does an excellent job of addressing the incomplete data problem to highlight the need to act quickly.