Log Analysis Basics

by Martin Brown

Server logs can provide volumes of meaningful data about what the server is doing — if you know how to read them. We discuss the key points and techniques of log analysis to help you get the most out of your log files.

You'd be amazed at how much information your machine, operating systems, and applications generate during their normal course of operation. One of my relatively quiet Unix servers, for example, generates about 2 MB of syslog information every week. But that information is completely useless unless it is converted into meaningful data about what is going on on the server. To do this, I need to know about errors, any potential problems, and any failures that could cause the machine to go down or fail at a critical time. In other words, I have to analyze the logs.

Log Types
Log Contents
Converting Logs Into Useful Information
Tracking Rather than Analysis

This article covers some of the basics of log analysis, hitting on what we believe are the key points and techniques, so you too can analyzes your voluminous server logs.

Log Types

Logs fall into a number of different categories, based on their format, source and typical contents. I'm not going to list them all here, it would take up the rest of the article and probably the rest of the year, but we can generalize into a few key types.

  • Content: Log content can be information, alerts and warnings, or fatal errors. The access logs in Apache and IIS are examples of informational logs. Alerts, warnings, and fatal errors are typically lumped together into a single 'error' log (which is essentially what the syslog is within Unix), or may be further split into specific types of errors or sources (in the style of the Event Log system in Windows). In some cases, all log information is dumped together into a single file, and it's the file content that helps describe what a particular entry is referring to.

  • Source: Logs come from everything — from applications and the system to drivers and libraries. The source is used as a method of classification. For example, the security subsystems may have their own log, or their log can define where and how information is updated. System logs are generated and handled by the operating system; application logs may be with the application, in a central location with the system logs, or in a temporary location.

  • Format: Logs can be in either a text or binary format. Not surprisingly, text is the more popular format because from the developer's and reader's perspectives, it is the easiest to work with. Binary is generally impossible to read without some form of processing, but information in a binary log may be better formatted and can use specific and structured data types for elements such as dates, times, and classification. This makes it easier to parse (provided the format is known) because complex assumptions or judgements on what the content might contain don't need to be made. Dates and times are examples of binary friendly data, but in a text file, they must be processed to identify them as recognizable, usable date.

Regardless of the originating format, location, and content, to get any useful information out of the logs they must be processed so each log entry is identifiable as well as each of the entry fields that make up the information.

Often, the log format is predetermined. There are standards for syslog, HTTP logs, and many others. However, if you are lucky, you can also change the format of the output log within certain applications. This lets you customize the contents of the log and the format of the contents, making it easier to post-process the contents. For example, the standard access log format within Apache 2.0 is configured with the following line in the configuration file:

LogFormat "%h %l %u %t \"%r\" %>s %b" common

However, it's completely configurable, so Apache can be set to create XML-like output by changing the above line to read:

LogFormat "<logitem><host>%h</host>
    </logitem>" \

Note that these lines (and many of the examples throughout this article) have been formatted for clarity and should all be on the same physical line. You can, by the way, use this text to achieve the same result within IIS 6 and Windows Server 2003.

>> Log Contents

Log Contents

Log Types
Log Contents
Converting Logs Into Useful Information
Tracking Rather than Analysis

The first step to analyzing the contents of your log files for information is picking out the real data from the log. To do this, you must understand the format. With text files, the information is normally formatted in a specific way with defined fields, using either a single character delimiter like a space or a colon, or using fixed-width fields. In addition, individual fields may also be delimited or formatted according to their content. The block below is an example from an Apache Web server: - - [11/Feb/2004:12:21:57 +0000] "GET / HTTP/1.1" 200 11669 - - [11/Feb/2004:12:21:59 +0000] "GET /mcslp.css HTTP/1.1" 200 4828 - - [11/Feb/2004:12:21:59 +0000] "GET /weather/images/3.gif HTTP/1.1" 200 566 - - [11/Feb/2004:12:22:21 +0000] "GET /mail/index.cgi?m=v&mbox=com-mcslp-lbt&id=2532 
  HTTP/1.1" 200 20656 - - [11/Feb/2004:12:22:22 +0000] "GET /mcslp.css HTTP/1.1" 304 0

This example shows a mixture of text delimiters for the fields in the form of spaces as well as field delimiters to signify the date/time and URL components of the log. Here's another example, this time from syslog:

May 16 18:14:30 twinsol sm-mta[22012]: [ID 801593 mail.info] i4GHEQxG022012: 
  from=<lwmeditors-bounces@shetland.sys-con.com>, size=20913, class=-30, nrcpts=1,
  proto=ESMTP, daemon=MTA, relay=postfix@plunder.dreamhost.com []
May 16 18:14:30 twinsol sm-mta[22017]: [ID 801593 mail.info] i4GHEQxG022012: 
  to=<com-mcslp-lbt@gendarme.mcslp.com>, delay=00:00:01, 
  xdelay=00:00:00, mailer=cyrusv2, pri=194913, relay=localhost, dsn=2.0.0, stat=Sent

Being able to read and understand these logs helps focus your approach and provides a basis to analyze the data.

Most log analysis tools will provide a range of information, but the most common information to be reported are the basic statistics of the log information. For example, from a Web log you can obtain a list of URLs visited and a count of the number of times they were accessed. This provides useful information about the popularity of a particular page or area of your site.

If your logs provide a range of information, particularly with something like the date and time or the report, you can also use this information to generate statistics. You can, for example, monitor the access to a particular page or area of your site over a period of time, perhaps to determine the most popular times for visiting different pages of the site. Over the longer term, you can use this information to get usage statistics for the site, watching how access grows or how different parts of the site gain popularity.

Other logs provide alternative types of information and statistics. For example, I use a log processor on my syslog to generate a list of e-mail messages transferred through the machine, recording the date/time, source, and destination address. I'm not as concerned with actual statistics as I am about extracting the salient information from the log.

>> Making Logs Useful

Converting Logs Into Useful Information

Log Types
Log Contents
Converting Logs Into Useful Information
Tracking Rather than Analysis

A host of log analysis tools translate a range of information based on the content of various logs into useful statistics, but it is actually relatively easy and straightforward to build your own. Although you can achieve this in any language, a scripting solution like Perl or Python is the most practical way to go because of its flexible data handling and built-in data types, like the Perl hash and Python dictionary.

Once you know the log format, pulling out the information is relatively easy. For example, here's a very simple parser, in Perl, for a standard Apache access log:

while (<INLOG>)

    ($host,$ident,$user,$time,$url,$success,$bytes) = m/^(\S+)\s+(\S+)\s+(\S+)\s+\
 [(.*)\]\s+"(.*)"\s+(\S+)\s+(\S+)[ ]*$/;
    ($day,$mon,$year,$hour,$min,$sec) = ($time =~ m%(..)/(...)/(....):(..):(..):(..)%);

Here's a similar solution, in Python:

while 1:
    line = file.readline()
    if line:
        splitline = string.split(line)
        if len(splitline) < 10:
            print splitline
         loc,httpver,success,bytes) = splitline

Now that we have the basic fields, we can build counters and cross-referencing systems to track and report on different elements. For example, to get a list of the unique URLs accessed we can use a hash or dictionary to count them up. In Perl this looks like:

$urlaccesses{$url} += 1;

while in Python we have to embed it into an try statement to set the initial value:

   urlaccess[loc] = urlaccess[loc] + 1
   urlaccess[loc] = 1

You can repeat the same basic process with any of the other values in the log that we've picked out through the field information. Once you've processed the log, simply output the summary information generated as the log was processed.

If you just want to extract specific fields of information from a log to report on the contents and ignore the unnecessary parts, you can ignore the statistical gathering and use the parser as a reformatting tool. The syslog extraction tool mentioned earlier, which extracts the mail source and destination, is written in Perl and looks like this:

my (%from, %to, $time, @id);
    if (m/mail.info/)
        if (m/(\S+\s+\d+\s+[\d:]+).*?mail.info\]\s+(\S+):.*?from=<(.*?)>/)
            push @id, $2;
            $from{$2} = $3;
            $time{$2} = $1;
        if (m/mail.info\]\s+(\S+):.*?to=(.*?),/)
            $to{$1} = $2;

We use a regular expression to pick out the necessary information, and in the process create a number of hashes that map the unique ID for each e-mail with its sender, destination, and date/time. To report on the information, we process through one of the hashes and print out the corresponding data:

foreach my $id (@id)
    if (exists($to{$id}))
        $to{$id} =~ s/[<>]//g;
        my ($pre,$post) = split /@/,$to{$id};
        next if ($pre =~ /ESMTP/);
        next if ($from{$id} =~ /admin\@mcslp.com/);
        my ($frombat,$fromaat) = split /\@/,$from{$id};
        $frombat = substr($frombat,0,8);
        printf("%s %-40s => %s\n",$time{$id},"$frombat\@$fromaat",$pre);

This example also demonstrates how to filter the information or parts of the log we don't like. In the above example, e-mail destined for the main administration account of the domain is ignored because we are interested only in user-related e-mails.

Finally, again as shown in the previous example, there is nothing to stop you from adjusting or reformatting source data to suit your requirements. In this case, we've removed the domain information in the destination address. In Weblogs, you might want to filter the report to work only on a particular part of the site or perhaps ignore URLs other than those relating to HTML or CGI content (e.g., images and movies).

Tracking Rather Than Analysis

Summarizing logs and generating statistics is relatively easy — we're talking about counting information based on a specific field's content, such as the URL or host. This information is fine if all you want is basic counts and statistics, but it may not provide for all of your information needs, as occasionally you may want to trace the progress of a user, issue, or element through the history of the log.

If you are tracking an individual user through the system, for example, you would want to identify which pages he or she viewed. This type of analysis goes beyond the basic statistical systems highlighted here. The basic processing and parsing of the log into an internal data structure remains the same, but how you later analyse that information differs.

If you want to know more about tracking and more complex log analysis techniques, let me know, and I'll cover the topic in an upcoming article. Please include any specific examples that you might want covered and provide details on the type of information you want to track and report.

This article was originally published on Thursday Jun 10th 2004
Mobile Site | Full Site