[ Top | Up |
Prev | Next |
Map | Index ]
Analog 6.0: How the web works
This section is about what happens when somebody connects to your web site, and
what statistics you can and can't calculate. There is a lot of confusion
about this. It's not helped by statistics programs which claim to calculate
things which cannot really be calculated, only estimated. The simple fact is
that certain data which we would like to know and which we expect to know are
simply not available. And the estimates used by other programs are not just a
bit off, but can be very, very wrong. For example (you'll see why below),
if your home page has 10 graphics on, and an AOL user visits it, most
programs will count that as 11 different visitors!
This section is fairly long, but it's worth reading carefully. If you
understand the basics of how the web works, you will understand what your web
statistics are really telling you.
1. The basic model. Let's suppose I visit your web site. I follow a
link from somewhere else to your front page, read some pages, and then follow
one of your links out of your site.
So, what do you know about it? First, I make one request for your front
page. You know the date and time of the request and which page I asked for
(of course), and the internet address of my computer (my host). I also
usually tell you which page referred me to your site, and the make and model
of my browser. I do not tell you my username or my email address.
Next, I look at the page (or rather my browser does) to see if it's got any
graphics on it. If so, and if I've got image loading turned on in my browser,
I make a separate connection to retrieve each of these graphics. I never log
into your site: I just make a sequence of requests, one for each new file I
want to download. The referring page for each of these graphics is your front
page. Maybe there are 10 graphics on your front page. Then so far I've made 11
requests to your server.
After that, I go and visit some of your other pages, making a new request for
each page and graphic that I want. Finally, I follow a link out of your site.
You never know about that at all. I just connect to the next site without
telling you.
2. Caches. It's not always quite as simple as that. One major problem
is caching. There are two major types of caching. First, my browser
automatically caches files when I download them. This means that if I visit
them again, the next day say, I don't need to download the whole page
again. Depending on the settings on my browser, I might check with you that
the page hasn't changed: in that case, you do know about it, and analog will
count it as a new request for the page. But I might set my browser not to
check with you: then I will read the page again without you ever knowing about
it.
The other sort of cache is on a larger scale. Almost all ISPs now have their
own cache. This means that if I try to look at one of your pages and
anyone else from the same ISP has looked at that page recently, the
cache will have saved it, and will give it out to me without ever telling
you about it. (This applies whatever my browser settings.) So hundreds of
people could read your pages, even though you'd only sent it out once.
3. What you can know. The only things you can know for certain are the
number of requests made to your server, when they were made, which files were
asked for, and which host asked you for them.
You can also know what people told you their browsers were, and what the
referring pages were. You should be aware, though, that many browsers lie
deliberately about what sort of browser they are, or even let users configure
the browser name.
And some people use "anonymizers" which deliberately send false
browsers and referrers.
4. What you can't know.
- You can't tell the identity of your readers.
Unless you explicitly require users to provide a password, you don't
know who connected or what their email addresses are.
- You can't tell how many visitors you've had.
You can guess by looking at the number of distinct hosts that have
requested things from you. Indeed this is what many programs mean when
they report "visitors". But this is not always a good estimate
for
three reasons. First, if users get your pages from a local cache server,
you will never know about it. Secondly, sometimes many users appear to
connect from the same host: either users from the same company or ISP,
or users using the same cache server. Finally, sometimes one user
appears to connect from many different hosts. AOL now allocates users a
different hostname
for every request. So if your home page has
10 graphics on, and an AOL user visits it, most programs will count that
as 11 different visitors!
- You can't tell how many visits you've had.
Many programs, under pressure from advertisers' organisations, define a
"visit" (or "session") as a sequence of requests
from the same host until there is a half-hour gap. This is an unsound
method for several reasons. First, it assumes that each host corresponds
to a separate person and vice versa. This is simply not true in the real
world, as discussed in the last paragraph. Secondly, it assumes that
there is never a half-hour gap in a genuine visit. This is also untrue.