Monday, January 21, 2008

Workloads: Standards? Customer? Proxy?

So what's the "Best Workload" to measure performance with? Industry standard workloads? Real-life customer applications? Proxy workloads developed to mimic customer applications? Micro-benchmarks? Marketing inspired metrics?

Well, it depends. Each has a purpose. Each has its drawbacks.

Let's review.

Industry standard workloads, typified by the wide breadth of SPEC.org workloads, are benchmarks painstakingly agreed-to by industry committees. The workloads are specifically intended for cross-industry comparisons of systems and software stacks, which can make the committees an "interesting" position to be on. I'm not sure I would have the patience and tact needed to function effectively in that realm.

In the Linux community, I have occasionally found some resistance to the world of industry standard workloads, with quick examples of things like SPECcpu2006, SPECjbb2005, SPECweb2005, and many others. Trying not to over-generalize, but sometimes the community views unique performance improvements targeted at these workloads as "benchmark specials". Ok, sometimes I have seen some members of the community get more than a little excited about a change being proposed which were presented as "we gotta have this for xxx benchmark".

I find this attitude frustrating at times, since the workloads usually were developed with the specific intent of addressing how customers use their systems. But note to self, I've discovered that when I can find a real-life customer example of the same workload, the proposed change (fix or improvement) is enthusiastically embraced. With that small change in how a problem is proposed, I've been very impressed with how well the broader Linux community wants to improve performance for real customers.
  • (by the way, quick aside, some industry standard benchmarks do not pan out because really clever "benchmark specials" DO emerge, and the comparisons then can lose any customer relevant importance)
Real-life customer applications. So why not test with "real" applications more often? While some customers do have specialized example applications that they can share with vendors, usually the customer application represents the "jewels" of their business, so it's not at all surprising that they have no interest in sharing. So some time and energy is spent working to understand the workload characteristics, which can then be leveraged in identifying proxy workloads. Most performance teams have a good set of sample workloads which can serve as examples to look at improvements and bottlenecks.

On occasion, a team can run into a customer application which is unique and not nicely covered by their standard suite. Valerie Henson ran into this a while back, and created a nice little program called eBizzy, which eventually ended up on SourceForge with a third release just made available in January 2008. Numerous performance problems have been uncovered with programs like this, with fixes flowing back into the mainline development trees for Linux and the software stacks. There are many other examples floating around the web.

Proxy Workloads. Creating workloads to mimic customer applications is what I loosely call proxy workloads. These are handy when they are available as source code and unencumbered with various specialized licenses. The workloads can be modified to address what the performance team is focusing on, and more importantly, they're available for the programmers in the community who are trying to reproduce the problem and prototype the fix. Using proxy workloads is a good example of where the combination of customer, performance teams, and programmers can effectively address performance issues with application programs..

Micro-benchmarks. Some people use very small, simplistic, programs for performance assessments. These targeted programs are often packaged together and are fairly quick and easy to execute and get simplistic answers from. These are practically more useful for digging into specific areas of a system implementation, but when used to generalize an overall system's performance, can cause headaches and mis-understandings.

I have seen several over-zealous system admins at some companies take a simple single-thread single-process integer micro-benchmark, run it on his new 1U x86 server with 1GB memory, and then run it on a new very large system (anyone's, this is not vendor exclusive), and make the obvious conclusion that new little 1U server is the new preferred server system for massively complex workloads planned for the datacenter. Somehow they completely miss the point that only one thread is running on the larger system with many cores and lots of memory, thus 98% idle. It'd be funny if these people were just sitting in labs, but it's amazing how an authoritatively written email inside a company can create so much extra work in refuting and explaining why the person who wrote it, perhaps, just perhaps, was an idiot.

That said, micro-benchmarks do play a critical role with performance teams, simply because they do in fact highlight particular aspects of a system, which allows some very focused testing and improvement activity. Improvements have to be made in a bigger context though, since you may tweak one part of the system, only to hurt another part.

Marketing inspired metrics. Finally, I'll close with a quick discussion on what I term the "marketing inspired metrics". While some companies in the marketplace are really good at this, I have to chuckle and admit I'm not convinced this is one of IBM's core strengths, or for that matter, Linux.

The ability to make up new performance metrics, highlighting vague workloads in the absence of any concrete comparable system criteria, and formulate an entire marketing campaign around new "performance terminology" is an art form. I shudder when marketing teams get that gleam in their eye and start talking about performance. Uh-oh.

It's so much safer when we tell them "This is Leadership", or they simply task us with "Make it Leadership". Leadership of course is always how you phrase it, and what you choose to compare to, but at least that's usually a facts-based discussion.

Focus on repeat'able data. So performance teams focus on real data, which is supported by analysis and assessments of the data, watching for and correcting the common mistakes in gathering data from a system. In Linux performance teams I've been involved with, there is a fanatical dedication to automating the process of gathering and scrubbing the data, building in safe-guards to catch the common mistakes, so that when work is done on a workload, be it industry standard, a customer application, a proxy workload, or a micro-benchmark, we're focusing on finding and fixing the problems, not on the steps of gathering the data.

And I always try to avoid those marketing inspired metrics, that's for the marketing teams.

No comments:

One of many bloggers.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.

The Usual Notice

The postings on this site are my own and don't necessarily represent IBM's positions, strategies, or opinions, try as I might to influence them.