Tuesday, August 4, 2009

How Does Google Count Absolute Unique Visitors?

As a test of the Answer service, Mahalo, I posed the following question:


How does Google Analytics calculate Absolute Unique Visitors?

I know that Google claims that they can report on the number of Absolute Unique Visitors over any time period. What I can't figure out is how they can be calculating this without doing very expensive database queries. I feel they must be making an approximation of some sort.

Otherwise, they would have to query the unique set of users who visited the site across a large time span, and remove duplicates in real time. They could not afford to do this for a site with millions of unique users.

I will reward the tip to the person who best answers this question by providing a feasible solution to the technical problem or explaining how the reported value is approximated. Even better, if it is backed by an authoritative explanation from Google developers.

Note that the crux of the problem is to avoid double-counting Returning Visitors that are duplicately counted across the time span of a report.


Unfortunately, even the best answerer did not understand the question. Perhaps there were not enough users on the site, nor did they have people with the needed expertise to figure out what I was asking.

I rescinded my $5 "tip", and actually got some Mahalo users mad at me for doing so. After given the problem some more thought, this is what I cam up with:


Here's how I would calculate Absolute Unique Visitors:

Data Collection

On the first visit of a user, for each day, I record how many days since their last visit (for the "returning" visitors - as opposed to the "new" visitors).

Data Aggregation

When Analytics is processing the raw data, they can collect buckets of counters for the total number of visitors that:

  • New (never visited before)
  • Visited 1 day ago or more (aka all "returning visitors")
  • Visited 2 days ago or more
  • Visited 3 days ago or more
  • etc. (they may choose to cut off the number of buckets at some reasonable maximum - which would set a max on the reported ranges they could accurately display).

Note that these are cumulative numbers - each bucket has strictly fewer users than the previous one.

Reporting

When the site owner asks for the Absolute Unique Users across a date range, the reporting engine can scan all the dates in the period and accumulate a sum as follows (pseudo-code):

Assumes Data[DAY] containing values:
  NEW - Number of new users who arrived that day
  RETURNING[N] - Number of users who arrived that day with a haitus of N 

days or more.

UNIQUE = 0
for DAY from 1 to N:
  UNIQUE += Data[DAY].NEW
  UNIQUE += Data[DAY].RETURNING[DAY]

UNIQUE is thus, the sum of all NEW users reported on each day (who are always unique), and then only those returning users who were not counted in a prior day (since they were last on the site before the beginning of the reporting period).