The problem with log files is that they track an enormous amount of
information -- not all of it much good
to the people that pay your salary.
In the first sections of this
series, I've talked about what goes into the
standard log files, and how you can change the
contents of those files.
This week, we're
looking at how to get meaningful information back
out of those log files.
is that although there is an enormous amount of
information in the log files, it's not much good
to the people that pay your salary. They want to
know how many people visited your site, what they
looked at, how long they stayed, and where they
found out about your site. All of that
information is (or might be) in your log
They also want to know the names,
addresses, and shoe sizes of those people, and,
hopefully, their credit card numbers. That
information is not in there, and you need to know
how to explain to your employer that not only is
it not in there, but the only way to get this
information is to explicitly ask your visitors
for this information, and be willing to be told
There is a lot of information
available to put in your log files, including the
of the remote machine
is almost the same as "who is visiting my web
site," but not quite. More specifically, it tells
you where that visitor is from. This will be
- Time of
- When did this person
come to my web site? This can tell you something
about your visitors. If most of your visits come
between the hours of 9 a.m. and 4 p.m., then
you're probably getting visits from people at
work. If it's mostly 7 p.m. through midnight,
people are looking at your site from home.
Single records, of course, give you very
little useful information, but across several
thousand 'hits', you can start to gather useful
- What parts of
your site are most popular? Those are the parts
that you should expand. Which parts of the site
are completely neglected? Perhaps those parts of
the site are just really hard to get to. Or,
perhaps they are genuinely uninteresting, in
which case you should spice them up a little. Of
course, some parts of your site, such as your
legal statements, are boring and there's nothing
you can do about it, but they need to stay on the
site for the two or three people that want to see
- And, of course,
your logs tell you when things are not working as
they should be. Do you have broken links? Do
other sites have links to your site that are not
correct? Are some of your CGI programs
malfunctioning? Is a robot overwhelming your site
with thousands of requests per second? (Yes, this
has happened to me. In fact, it's the reason that
I did not get this article in on time last week!)
HTTP is a stateless, anonymous
protocol. This is by design, and is not, at least
in my opinion, a shortcoming of the protocol. If
you want to know more about your visitors, you
have to be polite, and actually ask them. And be
prepared to not get reliable answers. This is
amazingly frustrating for marketing types. They
want to know the average income, number of kids,
and hair color, of their target demographic. Or
something like that. And they don't like to be
told that that information is not available in
the log files. However, it is quite beyond your
control to get this information out of the log
files. Explain to them that HTTP is
And even what the log files do
tell you is occasionally suspect. For example, I
have numerous entries in my log files indicating
that a machine called
my web site today. I can tell that this is a
machine that is on the AOL network. But because
of the way that AOL works, this might be one
person visiting my site many times, or it might
be many people visiting my site one time each.
AOL does something called proxying, and
you can see from the machine address that it is a
proxy server. A proxy server is one that one or
more people sit behind. They type an address into
their browser. It makes that request to the proxy
server. The proxy server gets the page
(generating the log file entry on my web site).
It then passes that page back to the requesting
machine. This means that I never see the request
from the originating machine, but only the
request from the proxy.
implication of this is that if, 10 minutes later,
someone else sitting behind that same proxy
requests the same page, they don't generate a log
file entry at all. They type in the address, and
that request goes to the proxy server. The proxy
sees the request and thinks "I already have that
document in memory. There's no point asking the
web site for it again." And so instead of asking
my web site for the page, it gives the copy that
it already has to the client. So, not only is the
address field suspect, but the number of request
is also suspect.
It might sound like the
data that you receive is so suspect as to be
useless. This is in fact not the case. It should
just be taken with a grain of salt. The number of
hits that your site receives is almost certainly
not really the number of visitors that came to
your site. But it's a good indication. And it
still gives you some useful information. Just
don't rely on it for exact numbers.
the real meat of all of this. How do you actually
generate statistics from your Web-server
There are two main approaches that
you can take here. You can either do it yourself,
or you can get one of the existing applications
that is available to do it for you.
you have custom log files that don't look
anything like the
Common log format,
you should probably get one of the available apps
out there. There are some excellent commercial
products, and some really good free ones, so you
just need to decide what features you are looking
So, without further ado, here's some
of the great apps out there that can help you
with this task.
The Analog web site
claims that about 29 percent of all web sites
that use any log analysis tool at all use Analog.
They claim that this makes it the most popular
log analysis tool in the world. This fascinated
me in particular, because until last week, I had
never heard of it. I suppose that this is because
I was happy with what I was using, and had never
really looked for anything else.
report, which you can see on the Analog web site,
seemed very thorough, and to contain all of the
stats that I might want. In addition to the pages
and pages of detailed statistics, there was a
very useful executive summary, which will
probably be the only part that your boss will
really care about.
Another log analysis tool that I have been
introduced to in the past few months is
WebTrends. WebTrends provides astoundingly
detailed reports on your log files, giving you
all sorts of information that you did not know
you could get out of these files. And there are
lots of pretty graphs generated in the report.
WebTrends has, in my opinion, two counts
The first is that it is really
expensive. You can look up the actual price on
their web site.
other is that it is painfully slow. A 50MB log
file from one site for which I am responsible
(one month's traffic) took about 3 hours to grind
through to generate the report. Admittedly, it's
doing a heck of a lot of stuff. But, for the sake
of comparison, the same log file took about 10
minutes using WWWStat. Some of this is just the
difference between Perl's ability to grind
through text files and C's ability. But 3 hours
seemed a little excessive.
Now that I've mentioned it, WWWStat is the
package that I've been using for about 6 years
now. It's fast, full-featured, and it's free.
What more could you want. You can get it at http://www.ics.uci.edu/pub/websoft/wwwstat/ and there is a companion package (linked from that same page) that generates pretty graphs.
It is very easy to automate WWWStat so that it generates your log statistics every night at midnight, and then generates monthly reports at the end of each month.
It may not be as full-featured as WebTrends, but it has given me all the stats that I've ever needed.
Another fine product from Boutell.com, Wusage is now in version 7. I've used it on and off through the years, and have always been impressed by not only the quality of the software but also the amazing responsiveness of the technical support staff.
You can get Wusage at http://www.boutell.com/wusage/
If you want to do your own log parsing and reporting, the best tool for the task is going to be Perl. In fact, Perl's name (Practical Extraction and Report Language) is a tribute to its ability to extract useful information from logs and generate reports. (In reality, the name ''Perl'' came before the expansion of it, but I suppose that does not detract from my point.)
Apache::ParseLog module, available from your favorite CPAN mirror, makes parsing log files simple, and so takes all the work out of generating useful reports from those logs.
For detailed information about how to use this module, install it and read the documentation. Once you have installed the module, you can get at the documentation by typing
Trolling through the source code for WWWStat is another good way to learn about Perl log file parsing.
Not much more to say here. I'm sure that I've missed out someone's favorite log parsing tool, and that's to be expected. There are hundreds of them on the market. It's really a question of how much you want to pay, and what sort of reports you need.
Thanks for reading. Let me
know if there are any subjects that you'd like to
see articles on in the future. You can contact me