In last months newsletter , Jack wrote about how he spends most of his time looking for latency. I could be wrong but I think he got his inspiration for the piece from an early discussion here where we both said that execution profiling doesn't seem to be as important as it had been. After reading what Jack has written tempered by my experiences, both old and recent, I've come to the conclusion that latency is bug that instead of being exposed as an incorrect result, it is exposed as a long pause.
Systems that we build today differ from those that we've built in the past in that there are many more opportunities for our threads of execution to be doing nothing. The systems we used to build tended to use run on one machine and that machine tended to have 1 CPU. Even if an application was running on a mainframe that contained many CPUs, the process model was such that each application would run in a single thread and that pretty much effectivily confined it to a single CPU. Sure you could fork. But a fork created an entirely new single threaded process.
Though fork was useful, it didn't change the model. It was only with the introduction of posix threads did we start to see applications break out of this single threaded mode. Once that happened, applications started putting pressure on the operating systems. Now operating systems needed to be threaded if not at first, thread-safe. One example of threading maturty back then (or lack of it) was an apparent decision by the Solaris team to put a single thread around the entire kernel. While you'd guess that application threads would be tripping over themselves fighting for the OS lock, the reality was, it didn't seem to make that much of a difference. We were still very much bound by processor speeds and very much focused on execution hotspots, not lack of them.
Fast forward and today we have pleny of CPUs and plenty of machines and they are all fast enough for most problems. So we should be sitting pretty. But as is so elegantly expressed in Heinz's "Law of Sudden Riches ", more isn't always helpful and in some cases can be harmful. At the heart of the matter is our well known friend Amdalh's Law. To get useful work done, we must often must cooperate. It is this cooperation that reduces are ability to scale. When we cooperate, one party is invariably waiting on the other party. In other words, the amount of time our threads spend doing nothing is a function of how long it takes the other cooperating thread to come to the table. Since threads are still necessarily a limiting resources, a thread doing nothing means that some other piece of useful work isn't getting done. The longer the thread has to wait, the less that gets done.
Threads that are doing nothing are not using the CPU. When that translates into threads are not making forward progress fast enough, you've got a problem that execution profiling isn't designed to handle. Instead one needs to look at thread dumps. Thread that are parked waiting for something to happen will be clearly visible in a thread dump.
Another set of very useful tools in the war on latency is Firebug and YSlow. I have found Firebug invaluble in tracing latency. The results from using this tool have resulted in clients canceling scheduled work that we determined was unneccessary. This has resulted in 100 of thousands of $$$ saved. More over, it freed resources to work on tasks that drove the business forward rather than sideways. I can only assume that others that have used this tool has experienced very similar results.
What I'm not saying is that execution profiling is not useful. Nor am I saying that counting instructions in order to minimize execution timing is a waste of time. Each of these techniques are still valid in todays world. What I am saying is that they are no longer near the top of my list of things that I do when I go to performance tune an application. What I am suggesting is that profiling for nothing is almost always the first activity I find myself engaged in.
I am curious how does one go about "profiling for nothing" on a production
server without any prior knowledge of the software & system execution
models and daily/hourly workload/activity patterns. It would be rare to
have every thread "parked" in a thread dump and as it typically in
production systems into is the small but frequent service waits that kill
the overall throughput and possibly the performance. Also thread dumps are
pretty much useful in the current format because one cannot easily
determine whether a series of call frames repeated across each dump are
indeed the same request or another following the same execution path (very
common with business applications).
William,
Naturally JXInsight does this "profiling nothing" and so much more, ;-).