The volume on the Web is forecasted to more than triple over the next three years and the category expecting the fastest growth is data. The solution: compression.
A longer version of this appeared on WebReference.
by Peter Cranstone
The volume on the Web is forecasted to more
than triple over the next three years and the category expecting the fastest
growth is data. Data and content will
remain the largest percentage of Web traffic and the majority of this
information is dynamic so it does not lend itself to conventional caching technologies. Issues range from Business to Consumer
response and order confirmation times, to the time required to deliver business
information to a road warrior using a wireless device, to the download time for
rich media such as music or video. Not surprisingly, the number one complaint among Web users is lack of speed. That's
where compression can help, by using mod_gzip.
The Solution: Compression
The idea is to compress data
being sent out from your Web server, and have the browser decompress this data
on the fly, thus reducing the amount of data sent and increasing the page
display speed. There are two ways to compress data coming from a Web server,
dynamically, and pre-compressed. Dynamic Content Acceleration compresses the data transmission data on
the fly (useful for e-commerce apps, database-driven sites, etc.).
Pre-compressed text based data is generated beforehand and stored on the server
(.html.gz files etc).
The goal is to send less
data. To do this the data must be
analyzed and compressed in real time and be decompressed with no user
interaction at the other end. Since
smaller amounts of data (less packets) are being sent, they consume less
bandwidth and arrive significantly faster. The network acceleration solutions
need to be focused on the formats utilized for data and content including HTML,
XML, SQL, Java, WML and all other text based languages. Both types of
compression utilize HTTP compression and compress HTML files fully three times
To get an idea of the improvement in speed involved, here's a live demonstration:
Real time Web server content acceleration test:
Why Compress HTML?
HTML is used in most Web pages,
and forms the framework where the rest of the page appears (images, objects,
etc). Unlike images (GIF, JPEG, PNG) which are already compressed, HTML is just
ASCII text, which is highly compressible. Compressing HTML can have a major
impact on the performance of HTTP especially as PPP lines are being filled up
with data and the only way to obtain higher performance is to reduce the number
of bytes transmitted. A compressed HTML page appears to pop onto the screen,
especially over slower modems.
The Last Mile Problem
The Web is as
strong as its weakest link. This has and always will be the last mile to the
consumer's desktop. Even with the rapid
growth of residential broadband solutions the growth of narrowband users and
data far exceeds its limited reach. According to Jakob Nielsen he expects the standard data transmission
speed to remain at 56K until at least 2003 so there is a distinct need to do
something to reduce download times. Caching data has its benefits, but only content reduction can make a
significant difference in response time. It's always going to be faster to
download a smaller file than a larger one.
Is Compression Built into the Browser?
Yes. Most newer browsers since 1998/1999 have been equipped to
support the HTTP 1.1 standard known as "content-encoding." Essentially the browser indicates to the
server that it can accept "content encoding" and if the server is capable it
will then compress the data and transmit it. The browser decompresses it and
then renders the page.
Only HTTP 1.1 compliant clients request compressed files. Clients that are not HTTP 1.1 compliant request and receive the files un-compressed, thereby not benefiting from the improved download times that HTTP 1.1 compliant clients offer. Internet Explorer versions 4 and above, Netscape 4.5 and above, Windows Explorer, and My Computer are all HTTP 1.1 compliant clients by default.
To test your browser, click on this link (works if you are outside a proxy server):
And you'll get a chart like this:
To verify that Internet Explorer is configured to use the HTTP 1.1 protocol:
Open the Internet Options property sheet
If using IE 4, this is located under the View menu
If using IE 5, this is located under the Tools menu
Select the Advanced tab
Under HTTP 1.1 settings, verify that Use HTTP 1.1 is selected (see Figure 1 below).
IE4/5 Setting HTTP 1.1
What is IETF Content-Encoding (or HTTP Compression)?
In a nutshell... it is simply a publicly defined way to compress HTTP content
being transferred from Web Servers down to Browsers using nothing more than
public domain compression algorithms that are freely available.
"Content-Encoding" and "Transfer-Encoding" are both clearly defined in the
public IETF Internet RFC's that govern the development and improvement of the
HTTP protocol which is the "language" of the World Wide Web. "Content-Encoding" applies to
methods of encoding and/or compression that have been already applied to
documents before they are requested. This is also known as "pre-compressing
pages." The concept never really caught on because of the complex file
maintenance burden it represents and there are few Internet sites that use
pre-compressed pages of any description. "Transfer-Encoding" applies to methods
of encoding and/or compression used DURING the actual transmission of the data
In modern practice, however, the two are now one and the
same. Since most HTTP content from major online sites is now dynamically
generated, the line has blurred between what is happening before a document is
requested and while it is being transmitted. Essentially, a dynamically
generated HTML page doesn't even exist until someone asks for it. The original concept of all pages being
"static" and already present on the disk has quickly become an 'older' concept
and the originally well defined separation between "Content-Encoding"
and "Transfer-Encoding" has simply turned into a rather pale shade of
gray. Unfortunately, the ability for any modern Web or Proxy Server to supply
"Transfer-Encoding" in the form of compression is even less available than the
spotty support for "Content-Encoding."
Suffice it to say that regardless of the two different publicly defined
"Encoding" specifications, if the goal is to compress the requested content
(static or dynamic) it really doesn't matter which of the two publicly defined
"Encoding" methods is used... the result is still the same. The user receives
far fewer bytes than normal and everything happens much faster on the client side. The publicly defined exchange
goes like this....
A Browser that is capable of receiving compressed
content indicates this in all of its requests for documents by supplying the
following request header field when it asks for something....
When the Web Server sees that request field then it
knows that the browser is able to receive compressed data in one of only 2
formats... either standard GZIP or the UNIX "compress" format. It is up to the
Server to compress the response data using either one of those methods ( if it
is capable of doing so).
If a compressed static version of the requested
document is found on the Web Server's hard drive which matches one of the
formats the browser says it can handle then the Server can simply choose to
send the pre-compressed version of the document instead of the much larger
If no static document is found on the disk which matches
any of the compressed formats the browser is saying it can "Accept" then the
Server can now either choose to just send the original uncompressed version of
the document or make an attempt to compress it in "real-time" and send the
newly compressed and much smaller version back to the browser.
Most popular Web Servers are still unable to do this final step.
The Apache Web Server which has 61 percent of the Web
Server market is still incapable of providing any real-time compression of
requested documents even though all modern browsers have been requesting them
and capable of receiving them for more than two years.
Microsoft's Internet Information Server is equally
deficient. If it finds a pre-compressed version of a requested document it
might send it but has no real-time compression capability.
IIS 5.0 uses an ISAPI filter to support GZIP compression. It works as follows. The user requests a page, the server sends the page and then stores a copy of it "compressed" in a temporary folder.
The next time a user requests the page it sends the one stored in the temp
What it then tries to do is constantly check that the pages in the temp
directory are always current, and if not gets a current page and then
IBM's WebSphere Server has some limited support for
real-time compression but it has "appeared" and "disappeared" from the marketplace through various release
versions of WebSphere.
The very popular Squid proxy server from NLANR
also has no dynamic compression capabilities even though it is the de-facto
standard proxy-caching software used just about everywhere on the Internet.
The original designers of the HTTP
protocol really did not foresee the current reality with so many people using
the protocol that every single byte would count. The heavy use of
pre-compressed graphics formats such as .GIF and the relative difficulty to
further reduce the graphics content makes it even more important that all other
exchange formats be optimized as much as possible. The same designers also did not foresee that most HTTP content
from major online vendors would be generated dynamically and so there really is
no real chance for there to ever be a "static" compressed version of the
requested document(s). Public IETF Content-Encoding is still not a "complete"
specification for the reduction of Internet content but it does work and the
performance benefits achieved by using it are both obvious and dramatic.
What is GZIP?
a lossless compressed data format. The deflation algorithm used by GZIP (also zip and zlib)
is an open-source, patent-free variation of LZ77 (Lempel-Ziv 1977, see reference below). It finds
duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the
previous string, in the form of a pair (distance, length), distances are limited to 32K bytes, and
lengths are limited to 258 bytes. When a string does not occur anywhere in the
previous 32K bytes, it is emitted as a sequence of literal bytes. (In this description, "string" must be taken
as an arbitrary sequence of bytes, and is not restricted to printable
The short answer is "only if it can get there quicker." In 99% of all cases it makes sense to compress the data. However there
are several problems that need to be solved to enable seamless transmission
from the server to the consumer.
Compression should not conflict with MIME types
Dynamic compression should not effect server performance
Server should be smart enough to know whether the user's browser can decompress the content
Let's create a simple scenario. An HTML file which contains a large music listing in the form
of a table.
This file is
679,188 bytes in length.
Let's track this download over a 28K modem and then compare the results before and after compression.
The theoretical throughput over a 28K modem is 3,600 bytes per second. Reality
is more like 2,400 bytes per second but for the sake of this article we will
work at the theoretical maximum. If there was no modem compression then the
file would download in 188.66 seconds. On the average with modem compression
running we can expect a download time of about 90 seconds which indicates about
a 2:1 compression factor. The total number of packets transmitted from modem to
modem effectively "halved" the file size. But note that the server still had to
keep open the TCP/IP sub system to "send" all the bytes to the modem for
transmission. What happens if we can compress the data prior to transmission
from the server. The file is 679,188 bytes in length. If we can compress it
using standard techniques (which are not optimized for HTML) then we can expect
to see the file be compressed down to 48,951 bytes. This is a 92.79%
compression factor. We are now transmitting only 48,951 bytes (plus some header
information which should also be compressed but that's another story). Modem
compression no longer plays a factor because the data is already compressed.
Where are the performance improvements?
Bandwidth is conserved
Compression consumes only a few milliseconds of CPU time
The server's TCP/IP subsystem only has to server 48,851 bytes to the modem
At a transfer rate of 3,600 bytes per second the file arrives in 13.6 seconds instead of 90 seconds
Compression clearly makes sense as long as it's seamless and doesn't kill server performance.
What else remains to be done?
A lot! Better algorithms need to be invented that compress the data stream more
efficiently than gzip. Remember gzip was designed before HTML came along. Any
technique which adds a new compression algorithm will require a thin client to
decode and possibly tunneling techniques to enable it "firewall friendly." To
sum up we need:
Improved compression algorithms optimized specifically for HTML/XML
Header compression. Every time a browser requests a page
it sends a header file. In the case of WAP browsers header information can be
as high as 900 bytes. With compression this can be reduced to less than a 100.
Compression for WAP. (Currently WAP/WML does not
support a true entropy encoding technique. It uses binary encoding to compress
the tags while ignoring the content.)
Dynamic compression for caching servers.
Real time compression/encryption with tunneling.