HTTP Compression Speeds up the Web

by ServerWatch Staff

The volume on the Web is forecasted to more than triple over the next three years and the category expecting the fastest growth is data. The solution: compression.

A longer version of this appeared on WebReference.

by Peter Cranstone

The volume on the Web is forecasted to more than triple over the next three years and the category expecting the fastest growth is data. Data and content will remain the largest percentage of Web traffic and the majority of this information is dynamic so it does not lend itself to conventional caching technologies. Issues range from Business to Consumer response and order confirmation times, to the time required to deliver business information to a road warrior using a wireless device, to the download time for rich media such as music or video. Not surprisingly, the number one complaint among Web users is lack of speed. That's where compression can help, by using mod_gzip.

The Solution: Compression

The idea is to compress data being sent out from your Web server, and have the browser decompress this data on the fly, thus reducing the amount of data sent and increasing the page display speed. There are two ways to compress data coming from a Web server, dynamically, and pre-compressed. Dynamic Content Acceleration compresses the data transmission data on the fly (useful for e-commerce apps, database-driven sites, etc.). Pre-compressed text based data is generated beforehand and stored on the server (.html.gz files etc).

The goal is to send less data. To do this the data must be analyzed and compressed in real time and be decompressed with no user interaction at the other end. Since smaller amounts of data (less packets) are being sent, they consume less bandwidth and arrive significantly faster. The network acceleration solutions need to be focused on the formats utilized for data and content including HTML, XML, SQL, Java, WML and all other text based languages. Both types of compression utilize HTTP compression and compress HTML files fully three times smaller.

To get an idea of the improvement in speed involved, here's a live demonstration:

Real time Web server content acceleration test:

Why Compress HTML?

HTML is used in most Web pages, and forms the framework where the rest of the page appears (images, objects, etc). Unlike images (GIF, JPEG, PNG) which are already compressed, HTML is just ASCII text, which is highly compressible. Compressing HTML can have a major impact on the performance of HTTP especially as PPP lines are being filled up with data and the only way to obtain higher performance is to reduce the number of bytes transmitted. A compressed HTML page appears to pop onto the screen, especially over slower modems.

The Last Mile Problem

The Web is as strong as its weakest link. This has and always will be the last mile to the consumer's desktop. Even with the rapid growth of residential broadband solutions the growth of narrowband users and data far exceeds its limited reach. According to Jakob Nielsen he expects the standard data transmission speed to remain at 56K until at least 2003 so there is a distinct need to do something to reduce download times. Caching data has its benefits, but only content reduction can make a significant difference in response time. It's always going to be faster to download a smaller file than a larger one.

Is Compression Built into the Browser?

Yes. Most newer browsers since 1998/1999 have been equipped to support the HTTP 1.1 standard known as "content-encoding." Essentially the browser indicates to the server that it can accept "content encoding" and if the server is capable it will then compress the data and transmit it. The browser decompresses it and then renders the page.

Only HTTP 1.1 compliant clients request compressed files. Clients that are not HTTP 1.1 compliant request and receive the files un-compressed, thereby not benefiting from the improved download times that HTTP 1.1 compliant clients offer. Internet Explorer versions 4 and above, Netscape 4.5 and above, Windows Explorer, and My Computer are all HTTP 1.1 compliant clients by default.

To test your browser, click on this link (works if you are outside a proxy server):

And you'll get a chart like this:


To verify that Internet Explorer is configured to use the HTTP 1.1 protocol:

  1. Open the Internet Options property sheet
    • If using IE 4, this is located under the View menu
    • If using IE 5, this is located under the Tools menu
  2. Select the Advanced tab
  3. Under HTTP 1.1 settings, verify that Use HTTP 1.1 is selected (see Figure 1 below).

IE4/5 Setting HTTP 1.1

What is IETF Content-Encoding (or HTTP Compression)?

In a nutshell... it is simply a publicly defined way to compress HTTP content being transferred from Web Servers down to Browsers using nothing more than public domain compression algorithms that are freely available.

"Content-Encoding" and "Transfer-Encoding" are both clearly defined in the public IETF Internet RFC's that govern the development and improvement of the HTTP protocol which is the "language" of the World Wide Web. "Content-Encoding" applies to methods of encoding and/or compression that have been already applied to documents before they are requested. This is also known as "pre-compressing pages." The concept never really caught on because of the complex file maintenance burden it represents and there are few Internet sites that use pre-compressed pages of any description. "Transfer-Encoding" applies to methods of encoding and/or compression used DURING the actual transmission of the data itself.

In modern practice, however, the two are now one and the same. Since most HTTP content from major online sites is now dynamically generated, the line has blurred between what is happening before a document is requested and while it is being transmitted. Essentially, a dynamically generated HTML page doesn't even exist until someone asks for it. The original concept of all pages being "static" and already present on the disk has quickly become an 'older' concept and the originally well defined separation between "Content-Encoding" and "Transfer-Encoding" has simply turned into a rather pale shade of gray. Unfortunately, the ability for any modern Web or Proxy Server to supply "Transfer-Encoding" in the form of compression is even less available than the spotty support for "Content-Encoding."

Suffice it to say that regardless of the two different publicly defined "Encoding" specifications, if the goal is to compress the requested content (static or dynamic) it really doesn't matter which of the two publicly defined "Encoding" methods is used... the result is still the same. The user receives far fewer bytes than normal and everything happens much faster on the client side. The publicly defined exchange goes like this....

A Browser that is capable of receiving compressed content indicates this in all of its requests for documents by supplying the following request header field when it asks for something....

  • When the Web Server sees that request field then it knows that the browser is able to receive compressed data in one of only 2 formats... either standard GZIP or the UNIX "compress" format. It is up to the Server to compress the response data using either one of those methods ( if it is capable of doing so).

  • If a compressed static version of the requested document is found on the Web Server's hard drive which matches one of the formats the browser says it can handle then the Server can simply choose to send the pre-compressed version of the document instead of the much larger uncompressed original.

  • If no static document is found on the disk which matches any of the compressed formats the browser is saying it can "Accept" then the Server can now either choose to just send the original uncompressed version of the document or make an attempt to compress it in "real-time" and send the newly compressed and much smaller version back to the browser.

    Most popular Web Servers are still unable to do this final step.

    • The Apache Web Server which has 61 percent of the Web Server market is still incapable of providing any real-time compression of requested documents even though all modern browsers have been requesting them and capable of receiving them for more than two years.

    • Microsoft's Internet Information Server is equally deficient. If it finds a pre-compressed version of a requested document it might send it but has no real-time compression capability.

      IIS 5.0 uses an ISAPI filter to support GZIP compression. It works as follows. The user requests a page, the server sends the page and then stores a copy of it "compressed" in a temporary folder. The next time a user requests the page it sends the one stored in the temp directory.

      What it then tries to do is constantly check that the pages in the temp directory are always current, and if not gets a current page and then compresses it.

    • IBM's WebSphere Server has some limited support for real-time compression but it has "appeared" and "disappeared" from the marketplace through various release versions of WebSphere.

    • The very popular Squid proxy server from NLANR also has no dynamic compression capabilities even though it is the de-facto standard proxy-caching software used just about everywhere on the Internet.

    The original designers of the HTTP protocol really did not foresee the current reality with so many people using the protocol that every single byte would count. The heavy use of pre-compressed graphics formats such as .GIF and the relative difficulty to further reduce the graphics content makes it even more important that all other exchange formats be optimized as much as possible. The same designers also did not foresee that most HTTP content from major online vendors would be generated dynamically and so there really is no real chance for there to ever be a "static" compressed version of the requested document(s). Public IETF Content-Encoding is still not a "complete" specification for the reduction of Internet content but it does work and the performance benefits achieved by using it are both obvious and dramatic.

    What is GZIP?

  • It's a lossless compressed data format. The deflation algorithm used by GZIP (also zip and zlib) is an open-source, patent-free variation of LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length), distances are limited to 32K bytes, and lengths are limited to 258 bytes. When a string does not occur anywhere in the previous 32K bytes, it is emitted as a sequence of literal bytes. (In this description, "string" must be taken as an arbitrary sequence of bytes, and is not restricted to printable characters.)

    Technical Overview

    HTML/XML/JavaScript/text compression: Does it make sense?

    The short answer is "only if it can get there quicker." In 99% of all cases it makes sense to compress the data. However there are several problems that need to be solved to enable seamless transmission from the server to the consumer.

    • Compression should not conflict with MIME types
    • Dynamic compression should not effect server performance
    • Server should be smart enough to know whether the user's browser can decompress the content

    Let's create a simple scenario. An HTML file which contains a large music listing in the form of a table. This file is 679,188 bytes in length.

    Let's track this download over a 28K modem and then compare the results before and after compression. The theoretical throughput over a 28K modem is 3,600 bytes per second. Reality is more like 2,400 bytes per second but for the sake of this article we will work at the theoretical maximum. If there was no modem compression then the file would download in 188.66 seconds. On the average with modem compression running we can expect a download time of about 90 seconds which indicates about a 2:1 compression factor. The total number of packets transmitted from modem to modem effectively "halved" the file size. But note that the server still had to keep open the TCP/IP sub system to "send" all the bytes to the modem for transmission. What happens if we can compress the data prior to transmission from the server. The file is 679,188 bytes in length. If we can compress it using standard techniques (which are not optimized for HTML) then we can expect to see the file be compressed down to 48,951 bytes. This is a 92.79% compression factor. We are now transmitting only 48,951 bytes (plus some header information which should also be compressed but that's another story). Modem compression no longer plays a factor because the data is already compressed.

    Where are the performance improvements?

    • Bandwidth is conserved
    • Compression consumes only a few milliseconds of CPU time
    • The server's TCP/IP subsystem only has to server 48,851 bytes to the modem
    • At a transfer rate of 3,600 bytes per second the file arrives in 13.6 seconds instead of 90 seconds

    Compression clearly makes sense as long as it's seamless and doesn't kill server performance.

    What else remains to be done?

    A lot! Better algorithms need to be invented that compress the data stream more efficiently than gzip. Remember gzip was designed before HTML came along. Any technique which adds a new compression algorithm will require a thin client to decode and possibly tunneling techniques to enable it "firewall friendly." To sum up we need:

    1. Improved compression algorithms optimized specifically for HTML/XML
    2. Header compression. Every time a browser requests a page it sends a header file. In the case of WAP browsers header information can be as high as 900 bytes. With compression this can be reduced to less than a 100.
    Compression for WAP. (Currently WAP/WML does not support a true entropy encoding technique. It uses binary encoding to compress the tags while ignoring the content.)
  • Dynamic compression for caching servers.
  • Real time compression/encryption with tunneling.

    Further Reading

  • This article was originally published on Friday Oct 13th 2000
    Mobile Site | Full Site