Posted by: zyxo | January 13, 2010

Customers, website visitors, passing cars and forest birds : Howmany ?


Howmany customers does the grocer around the corner have ?
Howmany customers do you have on your website ?
Howmany cars are there that use a particular crossroad ?
Howmany birds of a particular species live in a determined patch of forest ?

Howmany ? Is there a way of finding out ?
(aside of the eventual meaningfulness of this question, I find it an intruiging one)

A first step is simple : Unique visitors in a given period of time.

– You just watch the grocery for an afternoon and count all people that enter. Make sure you do not count the returning one that forgot the suger twice !
– Get the unique visitor number of your website from Google Analytics ore whatever web analytics tool you use.
– Get a mojito from the bar at the corner and write down all license numbers you see, for one or two hours. Afterwards eliminate al returning ones.
– take a walk that covers the entire forest patch and count each individual of that particular bird species.

Simple, is it not ?
But did that get you the actual TOTAL number ? NO

What about the loyal grocery customers that came yesterday and will return tomorrow ? You missed them.
Not everyone visits your website all the time.
Some people leave their car at home and take a walk … to the forest where not every bird will show itself or will be singing.

Realise you only got a fraction of the number.

In biology they use something like capture-recapture.
1. first step : capture some birds, put a ring on one of their feet (identify customers, drop a coockie when they visit your website, write down the license numbers of passing cars)
2. second step : capture some birds, count the number with and without ring (count returning customers, count returning vs. first-time customers/cars)
3. do the simple math: identified/non-identified = marked/total_number
If the first time you captured 100 birds, the second time you captured the same number and 25 of them were already ringed, you can say that in your forest 1/4 of the birds are ringed, so the forest contains a total of 400 birds.
Idem for you website : if 50 % of your visitors are returning ones, you may say that you have twice the amount of coockies dropped as visitors.

These figures are correct … if we accept some assumptions, that we cannot accept :

– birds that are captured once become much more shy. They will be under-captured the second time.
– not all people have the same activity on the internet , or on the road. They do not all have the same probability of showing up. And some delete their coockies !
– The first day perhaps there was a tourist or two in the grocery who will never return !
– migration : some come, some go …
– with the same effect as migration : births, deads

Perhaps there is more to learn when we take the figures for a number of periods in succession, like say, capture birds or monitor the number of signons on your websitefor two weeks in a row.

Two things we can learn :
– an estimation of the total population ?
– what is the total number of identified individuals (ringed birds, website visitors who signed up) ?

We can assume that what we get should lay between two extremes :
1) No migration/births-deads
This first extreme should show us the actual, stable situation.

The chart shows two lines : the highly fluctuating line is the percentage of identified individuals, day per day, whereas the more stable line shows the cumulated data. These cumulated figures tend towards the real % of identified individuals in the population. Here we see that 24% carries an ID. By simply using the proportion we can easily calculate the total number of individuals.
OK, for birds in a forest it is simple, but for website visitors it is a bit more complicated. Not every visitor with an username/password for your website will sign on each time he visits, but let us assume this is the case anyway, for now.

The second chart shows the theoretical cumulative proportion of ID’d individuals. If each day 24% is ID’d, after 14 days nearly 100% of them will be captured, seen or have visited your site.

2) only migration/one-day flies
What about this second extreme ? This really means that you do not have any returning visitors, or that each bird you capture is some migrant passing trough.
So you never see ID’d or returning individuals. Not much information to show. Only the average number per day is interesting.
Although …
Let us take our second graph and add the line that corresponds with our second extreme :

The straight line shows the total number after the 1st, second, third, etc. day. Each day about the same number of individuals come along, but each day these are new ones, so they simply add up.

In real life you will find something in between the two lines (the yellow dots in the following chart). The closer your real-life line is to the curved one, the more stable your population of birds or website visitors. The closer to the straight line, the more volatile your population.

“Something between the two lines” actually means that you deal with two populations : your loyal returning customers (or you sedentary birds in the forest) and your one-time-customers( or birds accidentally passing through).
Considering this, it should be possible to reconstruct your actual, intermediate line, by combining the lines of these two populations… if only you should know them.
Fortunately there is something like excel, open-office calc, or whatever spreadsheet you may use. In stead of finding some complicated equation I made something simple, played a bit with the numbers to come up with the following chart :

The blue squares are the “actual” observations, the red diamonds represent the theoretical line for the “sedentary” population and the blue triangles show the one-pass-and-never-return population.
The two latter sum neatly up to match the actual observations.

What are the columns in my spreadsheet ?
1. day number (1, 2, 3, 4 …), is shown on the horizontal axis
2. volatile population (straight line) = one number, let’s call it A, multiplied by the day number from column 1
3. stable population (“theoritical” curve). This is more complicated. The first cell is a fraction (let’s say 25%) of the total number we will have seen on for example the 14th day. The second cell equals the first+the same fraction (25%) of the rest of the total number and so on.
4. the actual data = the cumulative number of individuals observed the first day, the first two days, the first three days etc.

Now you just have to tweak (play with) the three numbers : the one for the volatile population and the two (total number + fraction) for the stable population until the two populations sum up to (nearly) exactly the same data as column 4.
OK, this tweaking is not very scientific. You could do the necessary programming to obtain automatically the desired result or, if you are good at math, derive some equation to reach your goal more efficiently. At the end the result will be the same.

Did you enjoy this post ?  Then you might like the following :

Are you a good data miner ?
Men are more accurate than women … or lousy statistics ?
Good enough/data quality

Reblog this post [with Zemanta]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: