Modern operating systems give the impression of running more processes than the number of available processors by giving each process that wants to run, such as your web browser or text editor a tiny slice of time in which they can execute. By switching out the running process very quickly it looks as though all processes are running even though only a few are at a time. Other processes that want to run are placed in a queue where they wait for their slice of cpu time to execute. The number of processes waiting in this queue is called load average.
Most people who run a flavor of unix inevitably come across load average. It’s in quite a few tools such as uptime, top, and w. It’s usually displayed as a 1, 5, and 15 minute average. Load average can change based on the number of processes wanting to run on a machine. This is a deceptively evil monitoring metric because it’s very difficult to define what a good and bad load average is.
Load average is often used as a comparison metric when trying to diagnose machines that are overloaded. While participating in discussions about machine load I often hear things like, “But the load average is OK” or, “The load average is way high.”. The problem with this is that on one server an OK load average may be 10 and on a server that’s overloaded it may also be 10, or 5, or 200. For situations like this it’s much easier to use response time because it can be easily compared between different types of servers or software. A response time of one second is a response time of one second no matter where you are. It’s very easy to judge if that’s good or bad based on the application. On the same servers one could have a response time of one second and a load average of 200 and another could have a response time of one second with a load average of 50.
Using load average as a comparison metric is so difficult because a higher load average doesn’t really mean a server is more loaded or that it isn’t responding appropriately. The reason for this is that the time a process spends waiting in the queue can be effected by different forces including the number of processes in line before it, how long those processes run for, and if your process gets and interrupt that can let it jump the queue. This is why a server with a load average of 10 may be almost impossible to work with and a server on the same hardware right next to it with a load average of 200 may be responding just fine.
The moral of all of this is to use load average for exactly what it is, the depth of the run queue. Don’t use it for comparison between different types of servers and don’t quote it when saying a server is overloaded.