Improving performance on Linux (mostly Power)

Various thoughts on the process of improving performance on a Linux system - in a mode of discovering just how much there is to learn. Customers use their systems uniquely - some care passionately about performance, some just want and expect the best "out-of-the-box" experience with no tweaking. I have observed that people in search of performance answers generally want the simple answer, but the practiced answer to any real performance question is: "Well, it depends..." - Bill Buros

Wednesday, December 12, 2007

servicelog and hardware performance

Mike Strosaker put up a interesting blog over the weekend covering Servicelog which I caught on Planet-LTC. As I was reading it, it reminded me of just how critically important easy service'ability is for performance work.

When we do performance work in the Linux labs, we usually have an interesting mix of systems, some newer than new (also known as engineering models), some customer brand-new, some fairly ancient. For the systems we use directly, we have a good feel for the general ongoing performance characteristics of that system. We regularly run performance regressions of standard workloads watching for changes in the Linux software stack. Since we do this on a regular basis, any performance blip (sometimes an unexpected software regression, but usually a nice Linux improvement which is working its way through the process) is seen and we can go look at it. The side effect of this is we also can catch the rare occasion when something has happened on the hardware side.

One of the interesting aspects of the POWER line is the continued strong focus and evolution of the RAS characteristics of the systems. "RAS" can often get lost in the marketing collateral of new systems, and is usually ignored by the programmer community, but it really is pretty important to customers. For us, what this means on a practical level is that on rare occasions, the POWER systems will automagically detect possibly failing components and will disable those components if and when necessary, logging the errors.

For those who are constantly searching for more materials to read, a comprehensive description of the latest POWER RAS characteristics is available at http://www-03.ibm.com/systems/p/hardware/whitepapers/ras.html.
The Appendix A of the White Paper lists the Linux support provided by the SUSE and Red Hat Linux Distributions. A lot of people in the Linux teams around the world have worked on RAS support and have tested this thoroughly. The breadth of the coverage is pretty impressive and continues to be improved.
This coverage is key because it nicely highlights the "enterprise readiness" of Linux for customers.

So why is this important to performance?

Hardware failure examples are rare. So it's something we usually discover after spending way too much time doing software analysis, when we should have checked the service/error logs first. Eventually we get the right performance data which highlights the problem, and THEN we go check the service logs, which confirms what we determined. After we check the service log, we usually look at each other and state the obvious: "Hey, YOU should've checked the service log".

An example over the last year was with an older Power 5 system which was being used by a software vendor to do software performance testing. Unbeknownst to us, the system had cleverly detected that one of the L3 caches was failing, logged it appropriately, and then disabled the failing L3 cache so that the system could continue operating. The performance aspects were REALLY strange. We were running a multi-process multi-threaded workload that max'ed out the whole system, but the results were not balanced which made it look like a scheduler bug.

It was a long story, but eventually we checked the L3 cache sizes and confirmed that one of the L3 caches was "missing". We should've seen the following.... which shows the two 36MB L3-caches on this system...

cd /proc/device-tree/cpus/ find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \; 0000000 0240 0000 0000004 0000000 0240 0000 0000004

but instead we saw something like the following (this being from memory)... the L3 cache size of one of the caches was zero.

cd /proc/device-tree/cpus/ find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \; 0000000 0240 0000 0000004 0000000 0000 0000 0000004

The system config confirmed the behavior we were seeing. Once we got the piece fixed, the strange results were taken care of.

Anyway, circling back to Mike's servicelog post, that looks like an easy piece of software to try out. I popped it onto one of our SLES 10 sp1 servers in the lab, but it needs a Berkeley database library to link in - the website blithely says copy libdb-4.2.a into the source directory - and I don't have time this week to hunt that part down. I've got the .so file, and there's probably some easy way to bridge the two, but I'll try it out early in the new year. If the serviceable event process works as advertised, this would be nice to pop on our systems, and use it to highlight serviceable events on the system. Even better, what would be really cool is for it to send me a polite email that informs me about the error, but I suspect this needs some wrapper systems management software.

The key message for emerging performance analysts is to make sure the hardware is doing what it's supposed to be doing. A related topic in the future will cover the occasional disconnects between what someone tells you was ordered for the hardware you're testing on, and what is really in the hardware system you're testing on. Same story... you need to be sure you know what you're really working with.

Monday, November 26, 2007

Green 500 ?!

Interesting. Flippingly remarked in my last post that I wondered what the energy/thermal ratings were for the Top 500 clustered systems, and then this morning stumbled on the Green 500 list at http://green500.org.

My first thought was "how in the world do you measure power consumption across so many clustered systems??" And it appears that they calculate the overall power consumption depending on what can be measured, then the overall metric is calculated. The web site even nicely provides a tutorial paper on how to calculate / determine your power consumption.

http://green500.org/docs/tutorials/tutorial.pdf

More stuff to read and try this week. Time to pull the power meter over to a small Linpack test system running the latest RHEL 5.1 release where we recently pushed a number of publishes out and see what's happening. We regularly play with Linpack on the single SMP servers to make sure they scale fairly linearly (ie: 4-core to 8-core to 16-core) and this may be a good way to apply some power consumption metrics to the performance metrics. Linpack is good because it's fairly steady-state for a relatively long period (depending of course on how big you define the problem size to be solved).

Naturally... what, when, and how to measure watts and thermal and energy consumption for systems under test are going through many debates and discussions these days in the industry these days. A whole new dimension of being able to say "well, it depends" when asked about performance and the trade-offs. If you get a chance, watch what happens with SPEC.org's SPECpower initial benchmark (at http://www.spec.org/specpower/ ). This initial benchmark is focused on CPU centric workloads, but more dimensions are undoubtedly coming.

Sunday, November 25, 2007

The Worlds Fastest Computers

Caught this post from Linux Watch last week as we were getting ready for the Thanksgiving weekend.

http://www.linux-watch.com/news/NS7848919863.html - "The fastest computers are Linux Computers".

The post stemmed from the twice-annual posting of the Top.500 super computers as compiled on the Top500.org web site. This web site keeps track of the rankings of the world's Super Computers based on the Linpack benchmark results. We use Linpack on a day-to-day basis on single systems to keep track of changes being made across a number of pieces of the software stack, but playing on a single system is nothing compared to what these companies and partnerships are doing.

These top computers are amazing, they are a testament to the "scale-out" architectural approach of repetitive computing horsepower. Take a look at the number 1 system in the list. They describe the system on the Lawrence Livermore National Laboratory's web site at https://asc.llnl.gov/computing_resources/bluegenel/. According to the Top 500 web site, this "system" has got 212,992 processors and 73,728 GB memory. (I wonder what the power/watts/thermal measurements are for systems like this - couldn't find any energy star ratings on the BlueGene web site.)

Poking around some more, it turns out that the memory is referred to as "tebibytes". I probably should've known this, but the tebi prefix is short for "tera binary byte" and is intended for the 2 to the nth power numbers. So the tebibyte is 1024 to the 4th (or 2 to the 40th), where-as the more familiar terabyte is 1000 to the 4th (or 10 to the 12th). There's a nice table out on Wikipedia which had a good concise way of looking at things. It certainly is more precise - we often have clarification discussions about the difference between 1024 vs 1000 based numbers.

Back to the fastest computers running Linux. The Linux Watch post states that 426 of the top 500 computers rely on Linux. The Linux operating system base is easily able to be optimized for the various software stack pieces. One of the surprises in the statistics list was how pervasive Gigabit Ethernet was for the interconnect. I had heard that Infiniband was the preferred interconnect technology and had assumed this would be the hands-down favorite. Indeed, Gigabit Ethernet was listed at 54% of the Top 500, with Infiniband the 2nd most pervasive at 24%. The remaining systems use a variety of specialized interconnect technologies. It's interesting to see these off-the-shelf technologies being leveraged in the Top 500 Super Computer configurations.

These Top 500 super computer systems are a world in and of themselves. They really represent the top of the stack for HPC workloads, and have fairly unique configuration challenges which in many cases dive into the "research" world and some amazing latency, shared file, memory accessibility, and CPU interconnect technologies. For Linux customers, these technologies are in some cases bleeding edge technologies which over time are product'ized and rolled into the commercially supported Distros, and in other cases are already shipping in today's customer available distros.

A good example of technology which is pervasive and mature for commercial, research, and academic use is the OpenMPI project. Over the last couple of months we've started looking more into OpenMPI and related MPI products and have found that OpenMPI is very competitive. The Open MPI organization (at http://www.open-mpi.org/ ) is very active and keeps the MPI implementation at the leading edge across a number of offerings. The feature set of Open MPI v1.2.4 is impressive and provides good flexibility across networks, interconnects, and system implementations. Very impressive. Our work on small clusters is based on OpenMPI and allows us to focus on other system performance issues and concerns.

It'll be interesting to see what we find over the coming months. It'll be fun to start learning more about these large system scaled-out configurations and see what we can apply to real-life customers today.

Sunday, November 4, 2007

Oprofile - a Visual interface

So in performance, we profile systems a lot. In our performance testing and analysis, we usually rely on normal Linux command-line interfaces and a tailored automated analysis framework for running workloads, micro-benchmarks, and industry standard benchmarks. The test framework is similar to several derivative branches worked on in the Linux community by people like Martin Bligh[1], Andy Whitcroft, and many others. For example, in the community, AutoTest is one of the latest incarnations of automated test environments[2]. Our current autobench framework is driven primarily these days by a member of our Austin analysis team, Karl Rister.

For our performance work, we wrap and re-wrap things with simplifying scripts and strive for parse'able text output. The key focus for our performance work is to be able to crank the same workload across different OS and software levels, and use/compare varying IBM hardware platforms that Linux supports (x86, Power, Blades, Servers, etc). This helps us more quickly understand where Linux needs to be improved, or if there is something to be improved in the various hardware/firmware combinations and configurations that customers would want to use. Or in some cases when working with new workloads, in figuring out what exactly we're running, and working to understand what we're really measuring.

But what about the more elaborate user-graphical interfaces to performance tools? Not so much. The two user interface assists we do deploy are wrapping things with web interfaces (easy to crank out, easy to adjust), and graphing the generated data - more simply put: "a picture is worth a thousand words".

That said, someone recently asked about the latest Visual Performance Analyzer (VPA) tool which can be used with oprofile on Linux. Last time I had played with it was several years ago, so it was time to give it a new shot. And I found it was pretty nice. It won't replace our automation framework, but it was a nice alternative way to get and view profile information, especially from a remote system looking "into" a system under test. And sometime over the coming weeks, I'd like to see if we can tap into our pool of previously run and saved profiled workloads. Not sure if that's possible, but we'll see.

So here's what I found, with some quick context definitions as needed.

Using a profiler[3] is the simple process of measuring a program or system's behavior while things are running. The simple approach is to sample what each CPU is doing on a regular basis - say every 100,000 timer ticks - and keep track of where in the program (or the system) things are spending the most time. Profilers can do this "automatically", and can be tailored to do some pretty clever tricks using different hardware counters.

Application programmers use profiling to figure out how to optimize their programs, usually improving data structures, loops, and program structure. Each programming language is unique, and has unique performance challenges. In our world though, we're focused on a hierarchy of performance challenges. With new hardware or new operating system code, we are generally focused more on how to optimize Linux for the hardware, looking for the typical "performance hazards" which can be improved in software, firmware, or hardware.

For VPA, I downloaded the x86 rpm version of VPA from IBM's alphaWorks[4] and installed that on a recently installed RHEL 5.1 Linux laptop. and also tried the zip'ed version for Windows. While the files were downloading, I did read on the alphaWorks website:

What is Visual Performance Analyzer?

Visual Performance Analyzer (VPA) is an Eclipse-based performance visualization toolkit. It consists of five major components: Profile Analyzer, Code Analyzer, Pipeline Analyzer, Counter Analyzer, Trace Analyzer, and Control Flow Analyzer.

Interesting. Eclipse-based? So do I need to install the full Eclipse environment? Apparently not. Maynard Johnson (our dependable oprofile Answer-Man) told me that Eclipse now supports a "Rich Client Platform" (RCP) which allows GUI applications to build with Eclipse widgets, without a full Eclipse platform. And sure enough, the two files I downloaded from alphaWorks were both called "vpa-rcp-6.0.0*".

VPA comes with several tools included. The one I was interested in was the Profiler Analyzer. I tried examples from both the Linux client and the Windows client.

These screen shot images are from the Windows client using SnagIt [5]. I really wish we had this tool available on Linux... I recommend going to the Techsmith website and request a SnagIt product port to Linux clients, especially since they're already considering one to the Mac (per their website).

Invoking the Profile Analyzer, I simply defined the connection to the server I was testing. Plenty of options. Lots of flexibility. I simply wanted to see if I could connect to the system, and profile a couple of seconds of the system sitting there idling along. By the way, oprofile was previously installed on the system being tested - but that's easy to do as well.

Worked easily. Connected cleanly. Push-buttons to start and stop profiling. Then I could click on the output, and poke around to see where things were running.

(Quick aside: for those with good eyes, the system being tested was named xbox.ltc.austin.ibm.com - it's just a normal Power 5 based system running SLES 10. The analyst who named a set of three related systems has caused no end of raised eyebrows as we report on work we've done on systems named xbox, ps3, and wii.... clever... but mis-leading at times....)

The system under test wasn't doing anything (just idle), but I was impressed with how easy this was. In the coming weeks, we'll play more. But you should consider downloading VPA as something easy to try, play with, and get results.

Bill B

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
References:

[1] Martin Bligh - http://www.linuxsymposium.org/2006/view_bio.php?id=1697
[2] Autotest - http://test.kernel.org/autotest/
[3] Profilers - http://en.wikipedia.org/wiki/Profiler_%28computer_science%29
[4] Visual Performance Analyzer - http://www.alphaworks.ibm.com/tech/vpa
[5] SnagIt - http://www.techsmith.com/screen-capture.asp

Sunday, October 28, 2007

Linux is ready - Linux scalability example

Saw an interesting blog late last week from Paul Murphy from over on ZDNet[1]. While comparing AIX and Solaris offerings, he shifts to quickly assess the possibilities of Linux running on the Power systems.

He seems to miss some of the flexibility of Linux in that we believe it's pretty trivial to re-build applications from one Linux base or platform to another, he makes the observation:

"Right now getting Linux code to run on Linux for PPC isn’t very difficult - it’s very inefficient, because of the endian change and the inappropriateness of the x86 specific ideas embedded in just about everything Linux, but it’s relatively easy to do and it works."

With today's software vendors, most applications are designed very reasonably and thoughtfully with respect to cross-platform issues, which includes of course endian issues. So code built with Linux should be efficient with respect to that hardware base, and in fact, a lot of what Linux operating system programmers focus is exactly the notion of being efficient and effective across the platforms, be it x86, Intel, AMD, zSeries, and Power to name a few. There are of course more platforms, which highlights the advantages of leveraging the Linux base.

In working on Power systems, and seeing work on other platforms, it's clear that non-x86 design points and flexible approaches are encouraged in the operating system. This allows programmers across the industry to easily adapt and leverage Linux to provide leadership performance or scalability for that hardware platform.

In some of the "talkback" comments from Paul's post, there were several posts which were flash-backs to Linux of many years ago, where people would refer to experiences where Linux didn't scale, Linux didn't perform, or Linux wasn't ready. There have been examples for years now which consistently demonstrate that Linux is scalable and does perform well.

From two and a half years ago, Sandor Szabo [2] took the two Linux distributions, SLES 9 and RHEL 4 at the time, and showed the features of what could be exploited in commercial grade Linux. Both SUSE and Redhat have updated their distro versions, moving up to SLES 10 and RHEL 5, both of which continue to significantly improve on the scalability and performance.

In the summer of 2006, just 15 months go, the Kernel Summit specifically covered Linux scalability [3]. This was a good article that summarized nicely the state of Linux scaling across the systems emerging in the industry. I found it interesting that one of the focus items was shifting to scaling issues for systems with 1024 to 4096 processors.

On a practical level, we see Linux scaling and performing fine from 4-core to 16-core to 64-core systems that we test on. For example, we tried SPEC.org's SPECompM2001 workload on a recent new Power 6 16-core system [4] and were able to get the OpenMP threads performing where we wanted it to. The breadth of software support on Linux extends to platform specific compilers, and in that publish the library libhugetlbfs [4] was used to take transparently take advantage of larger memory pages by the system. This library is supported across several architectures and is being actively updated and extended. Each workload could be compiled 32-bit or 64-bit mode, with varying optimization levels.

Just a couple of examples of Linux being ready to perform and to scale today

- - - - - - - - - - - - - -
References...

[1] Linux: it’s the last word on AIX vs. Solaris by ZDNet's Paul Murphy -- With Linux on Power you get open source, you get a hot chipset, you get that IBM relationship, you get a clear future direction, you get a solid development community, you get access to lots of applications, you get a free any time escape to the cheaper Lintel world.

[2] Exploit the power of Linux with Informix Dynamic Server 10.0

[3] Kernel Summit 2006: Scalability

[4] SPECompM2001 - 16 core publish

[5] Sourceforge.net - libhugetlbfs project

Monday, October 22, 2007

So what performance can I expect?

While most days at work are busy with Linux performance analysis and improvement projects, I regularly get questions from customers, or our sales and support teams, plus seekers of information on performance questions. In these discussions, the prevailing quick question is invariably:

Say.... "My user has this application running on this system with this software. If we change one or more of these pieces, would the application run much faster?" That's usually the whole question. Then there's the expectant long pause as they wait for "The Response."

The answer of course is always: Well, it depends....

So we start with the two basic questions.

What specifically are you running today?
What are you really trying to improve?

I know it's going to be fun when the the first question turns into a lengthy informational gathering exercise. What system, specifically. What software, specifically. Is the software pre -built? or do you build it yourself? What system configuration, specifically. What are you measuring, specifically. I don't really keep score, but several of these informational requests never come back. I suspect they were looking for the Silver Bullet. It happens.

Assuming we make it past the first information scrub, we begin focusing on what they really want to improve. The choices are quite varied.. usually they just want faster execution time. Sometimes better (quicker) responsiveness. Sometimes there are other considerations, scaling up, scaling out, or server consolidation with various virtualization techniques. Whatever it is, understanding specifically what they are measuring is critical. In the realm of performance, we measure, change one thing, measure again, assess the change. And repeat.

I have observed that occasionally in the realm of some performance engagements, little is known, many things are changed, and conclusions quickly reached. Worse, these conclusions have an almost mystical ability to end up on an executive's desk.

So while we often get involved in helping to select the appropriate hardware platform, the focus in this new blog is where and how Linux can make a difference. These entries are intended to help me keep track of a journey of learning to be a better performance analyst. The three cases to cover are (a) when I think I already have the answer, or (b) more often - having some of the answer and hunting down those pesky details, or (c) in some cases, I really haven't a clue, and I'm off on another learning tangent again. As I said, it's a journey.

There's a lot of cool things happening around the world of Linux performance. In the coming weeks, I'll talk about some of the areas that are active, emerging, or are already in use. Some will be on Power where I spend a lot of time, but a lot is generic and usually available across the platforms. Whether this will help you improve the performance of your applications, well, it depends. We shall see.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.