Data Isn’t Objective: Dealing With Shady Data in Web Analytics

We like to pretend data has no biases and should be the incontestable arbiter of our disputes. But we need to be careful. Data is subjective. And it's subjective because we're subjective.

Most people who are involved in web analytics are generally aware of the fundamentals of interpreting data: be aware of sample size; look at the whole distribution, not just the average; compare to benchmarks, etc. But what I don’t think enough people are aware of is the human subjectivity that goes into creating that data in the first place.

Sometimes people assume the data they’re working with is perfect, and that the only thing they need to be wary of is not screwing up the way they interpret it. But that’s not true!

Data has a context and a history just like humans — it’s defined by humans. Even the most basic metrics have their own unique contexts and histories that are largely defined by the decisions humans made while defining them. Data can be shady!

So, where should you be on the lookout for shady data? And how should you deal with it when (not if, trust me) you stumble upon it?

Where Might You Find Shady Data?

In short, anywhere. Just about any metric in your web analytics could be mysteriously distorted. But for brevity’s sake, let’s take a closer look at just one common example — one of the most basic statistics in web analytics — pageviews of the home page.

You may think it’s really easy just to count how many times the home page was viewed — and in some sense it is — but when you dig deeper, that number in your reports is the result of a long list of decisions that humans made (or didn’t make).

For example, is there a redirect between two duplicate versions of your homepage, one with a “www” and one with none? If not, then your homepage data is being split between those two URLs. (Oh, and you should probably start redirecting from one version of the page to the other).

Or maybe you’re tracking more than one subdomain or domain together. In those page reports, are you displaying the entire URL instead of just the pathname? If not, then the data showing for “/home” in your page reports might be the combined data for www.example.com/home and sub.example.com/home.

And even if you are showing the full URL when you should be and redirecting duplicate pages when you should be, were you always doing that? If not, then whenever you look at past data, you need to be aware of how things worked back then. That 30% year-over-year increase in new visits to the home page might just be because last year example.com/index.html didn’t redirect to example.com. Or that bounce rate increase you’re so worried about might just be because last month you started accidentally combining the data for your main site with the data for your blog.

And that’s not all. We haven’t event started talking about bigger picture issues like tracking between different domains/subdomains, or filtering out certain visitors, or advanced features like event tracking and custom dimensions. All of these, depending on how they’re implemented, can have drastic effects on your data, even the simplest data you assume is straightforward.

A bit scary? Yes. But don’t worry. It just means you’ll have to do a little more work. And we’re here to help.

“Finding shady data can be the difference between a sad or happy analytics experience!”

So How do you deal with shady data?

Start by doing these four things:

1. Check The Annotations

Google Analytics has a very useful feature called Annotations that allows you to make notes in your analytics. GA users use these to note when changes were made to the website or the website’s analytics configuration. So, one of the first things you should do when you start digging through GA is to look at the annotations. Learn as much as you can from them about the history of your site’s analytics. Even if it doesn’t seem relevant initially, it may explain something weird you discover later. And don’t limit yourself to just the annotations in GA. Maybe someone used to keep notes on the account in a separate file somewhere.

2. Look For Signs of Trouble

As you’re analyzing the data, look out for weird statistical anomalies or sudden variations that might be a sign that something changed about the site or how the site was tracked. Here are a few examples to look out for:

Huge decreases in bounce rate. Maybe someone added event tracking code without designating the events as non-interactions.
1,000 different pages with recorded pageviews in one month, but only 500 in the next month. Maybe someone started redirecting pages that end with a slash to the duplicate page that doesn’t end with a slash.
Big drops in referral traffic. Maybe someone added a key URL to the Referral Exclusion List or enabled subdomain or cross-domain tracking.

3. Familiarize Yourself With the Old Data

Before you start making comparisons to past months or years, dig into that old data by itself. Look at just the data for 2012 before you start comparing it to the data for 2013. Once you understand the nature of that old data, then you can start making comparisons and seeing how things change over time. Don’t waste your time chasing trends that turn out to only be the result of a different tracking environment.

4. Analyze Skeptically

The less information you have, the more skeptical you should be. The less you know about your analytics setup two years ago, the bigger the grain of salt that should come with your conclusions about that time period.

So those are some ways to deal with the messy, unchangeable past. But what about the future? How do you avoid creating shady data in the first place?

More answers to these questions coming soon in Part 2.