Imagine a trailer hooked on to the back of an Indy car the effect on performance would be similar to what happens when you hook up hard drives to a server. That's because a hard drive is a slow, delicate, mechanical component attached to a computer system that is otherwise made up of solid-state electronics built for speed.
To get data on or off a disk, a drive head has to swing into the correct position and wait for the right part of the disk to come around. This can take hundredths of a second many times longer than it takes to access data stored in RAM, for example. As a result, a disk I/O subsystem can be a huge data bottleneck, and significant improvements in the overall performance of the server can be achieved if the effects of this bottleneck can be minimized.
To understand how this might be done, let's take a look at how data is moved to and from a disk.
Put simply, the I/O subsystem accepts a stream of requests to read or write data to and from a disk which it holds in a queue. To help speed things up, it usually merges read or write requests together if they are close to each other in the queue, and if they involve the same area of the disk.
Read requests are generally given higher priority than write requests because a given process will have to wait for the read data it has requested before it can continue, while it won't have to wait for the result of a write.
The subsystem will also usually detect when data is being read sequentially from the disk and use a technique called "read-ahead" to read and cache the disk block following the one it has been asked to read. This can reduce seek time during sequential reads, although it does nothing to speed up reads to other random parts of the disk, and it is switched off when the subsystem detects random (i.e., non-sequential) reads.
After this simple merging has been carried out, the I/O subsystem could process requests in the queue in the order they are arrive at the head of the queue, but usually it doesn't. Instead it tries to speed things up by using a scheduler algorithm to decide the order in which requests should be processed. At the most basic level a scheduler will reorder requests to minimize the traveling the disk heads have to carry out, as this mechanical process is relatively time consuming. So reads in one area of the disk will be grouped, then reads in the next area, and so on.
To illustrate what scheduler algorithms do, it's helpful to think of elevators as an analogy. When an elevator arrives at the ground floor of a busy office block, many people get in and press the buttons for the floors to which they want to go, in a random order. Clearly it would make no sense to travel to the fifth floor, then the 40th, then the second, and then the 39th, just because that was the order in which the buttons were pressed. A far more efficient strategy is to stop at the floors in ascending order: the second, fifth, 39th and finally the 40th. By traveling in a single direction instead of backing up on itself, the elevator can drop everyone off at their floors in the minimum time.
But it turns out that this type of scheduling is not always optimal for an I/O subsystem (or for an elevator); it actually has a number of flaws. Going back to the elevator analogy, people working on the very highest floors would have to waste a lot of time, stopping at tens or potentially hundreds of intermediate floors, every time they traveled from the lobby to their office. For that reason many large buildings have some elevators that don't serve the lower floors at all, instead they go straight to the upper levels before stopping. The upper-floor workers benefit, but at the cost of the lower-floor workers, who now have fewer elevators that go to their floors.
So what is the best scheduler algorithm? In fact there is no such thing as a "best" one. Instead, there are a number of different schedulers, each one suited to different types of servers. The four principal schedulers common to most Linux systems are:
- completely fair queuing
This is the simplest scheduler, and it does nothing beyond merging requests and placing them in a queue to be processed in the order they arrive at the head of the queue.
The noop scheduler is useful in a system which uses a modern RAID controller which can schedule I/O better than the operating system. It is also useful on systems with solid state drives (including RAM disks,) which have a fixed seek time for all data, regardless of where it is stored on the device.
The deadline scheduler is designed to overcome the problem of a read request for a remote part of a disk getting serviced too slowly because it is constantly shunted to the back of the queue by requests arriving for closer parts of the disk. This is analogous to trying to get to the top floor of a building in an elevator that keeps stopping to let people on and off at intermediate floors. To overcome this, each request is given a waiting time or "deadline." If the request is not serviced before the deadline expires, it is then put to the front of the queue, merged with any requests for nearby disk locations and serviced immediately.
The deadline scheduler is useful for real-time applications because in most cases it keeps latency low since all requests are serviced within a relatively short time frame dictated by the deadline.
The anticipatory scheduler works by anticipating that the next request will be for the disk block that follows the one that has just been serviced. After carrying out a read or write, the scheduler enforces a slight delay to give an application time to submit a request for the next disk block. If such a request arrives (as anticipated), the disk head will be in the right place to service the request, instead of having moved away to service another request.
Using the elevator analogy, imagine a situation where an office block is host to a number of companies, each of which occupies, say, two floors. If an office worker gets on at a given floor, it's a good bet she will want to get off at the next floor, which also belongs to her company. So it is sensible for the elevator to wait until the office worker has pressed the button for the floor she wants before it shoots past the next floor.
These slight delays add up to a small amount of latency added to the system, but in some circumstances this can easily be outweighed by the efficiency gained from having the head anticipate where it next has to go.
The anticipatory scheduler is suited to tasks that don't get regularly interrupted by external requests. It tends to be useful for web servers and workstations that often access data in a linear fashion, but not for database servers that access data in a more random fashion.
cfq stands for "completely fair queuing." In this system the scheduler maintains a number of internal request queues, and each process running on the system is assigned to one of these queues. The schedule then takes the requests from the front of each queue and places them into a dispatch queue, where they are ordered to minimize seek time and serviced. The scheduler then goes back to the internal request queues and brings more requests from the front of these queues to refill the dispatch queue.
This is analogous to a building with a single elevator, with separate queues for each company that has offices in the building. Each time the lift arrives in the lobby it takes staff from the front of the queue for each company, so no individual company can hog the elevator excessively. Cfq works well for medium and large multi-processor systems, and is the default scheduler for Linux distros such as Red Hat Enterprise Linux.
In the concluding part of this article I'll be looking at how to change I/O schedulers and how to change some of their parameters to tune a system.
Paul Rubens is an IT consultant and journalist based in Marlow on Thames, England. He has been programming, tinkering and generally sitting in front of computer screens since his first encounter with a DEC PDP-11 in 1979.