Improving performance on Linux (mostly Power)

Various thoughts on the process of improving performance on a Linux system - in a mode of discovering just how much there is to learn. Customers use their systems uniquely - some care passionately about performance, some just want and expect the best "out-of-the-box" experience with no tweaking. I have observed that people in search of performance answers generally want the simple answer, but the practiced answer to any real performance question is: "Well, it depends..." - Bill Buros

Monday, March 8, 2010

IBM donates Linux-based POWER7 super-computer

Rice University has deployed a sweet POWER7 based cluster based on Linux. The brand new IBM 750 systems in the cluster are based on Linux, Maui, Torque, openmpi, Infiniband, 10Gb Ethernet, the Advance Toolchain, IBM compilers and IBM's ESSL math libraries.

Check out http://bluebiou.rice.edu/.

Over the coming weeks, we'll describe numerous collaborative efforts underway to develop, deploy, leverage, and execute workloads on the cluster.

One of the exciting aspects of the project is the team at Rice University is well versed in managing and deploying open-source-based production-level HPC clusters used by hundreds of students and researchers.

The initial goal is to integrate the cluster into an existing Maui/Torque infrastructure deployed and in use at Rice University. Extending the infrastructure to allow researchers control of POWER7 SMT hardware threads on each node, the number of 16MB huge pages if desired, energy optimization techniques, and POWER tuning techniques.

Thursday, February 18, 2010

And here comes POWER7 !

While it's been some time since I last posted much, time flies when you're working on a new generation of hardware and systems.

The latest POWER7 based systems were just recently announced. Naturally, Linux is supported. Linux already exploits the POWER7 technologies and more is coming. For example, SLES 11 has the POWER7 enabling available and was used for numerous standard benchmarks used when we launch systems and operating system updates. For some quick performance data, check out this link.

IBM Power 750 Performance Data

In the days and weeks coming, several of us will be collaborating together to post insights, hints, and tips on using Linux to exploit the capabilities of IBM's latest POWER-based systems. The system capabilities of what's being delivered and what's coming down the pipeline is pretty impressive.

For more links, see Linux Performance.

For a taste of what's coming, see IBM's Statement of Direction on IBM Power Systems high-end servers.

Wednesday, October 22, 2008

Hey! Who's stealing my CPU cycles?!

I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.

The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.

In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
0  0      0 14578432 408768 943616    0    0     0     0    2    5  0  0 100  0  0
0  0      0 14578368 408768 943616    0    0     0     0   25   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0    32   12   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0     0   21   45  0  0 100  0  0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.

Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   11   34 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574656 408704 943488    0    0     0     0   10   34 50  0  0  0 50

For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

Measuring Stolen CPU cycles

Wednesday, October 15, 2008

Linux on Power.. links and portals..

Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

Thursday, October 9, 2008

SystemTap SIGUSR2 tracing

A story of how people can get all excited about "performance issues". And how simple and easy tools can help figure out the root cause.

I was involved recently in an issue with Java performance where a system had "flipped out" and gone to 100% CPU busy across all of the processor cores after a software upgrade. Numerous technical people got involved in a flurry of activity, and of course performance teams were involved because systems tend to perform poorly when the CPUs are busy 100% of the time.

Various people dug into things and determined that something was generating way-too-many SIGUSR2 signals to Java, which apparently was driving the Java engine into constant garbage collection. Many many many Java threads running. It was a sight to behold.

Naturally, everyone involved claimed that *they* were not responsible for the spurious SIGUSR2 signals. So clearly this was a hardware issue. As hardware engineers were being brought in, some peers in the SystemTap team quietly suggested that we use the SystemTap tool to help figure out who was sending the signals. Ahh, a note of reason.

Turns out there's an off-the-shelf SystemTap script available which snags the Signal.send entry point, and is already instrumented to track who's triggering the signal, and then count up the number of times each source triggers a signal.

The script showed who was triggerig the signals, which was exactly the ah-hah!" clue the technical teams wanted and needed. Shortly there-after a fix was found and applied, and the 100% CPU busy condition was solved. With the CPUs no longer 100% busy for no particular good reason, our performance work was done. If only these were all this easy.

We documented the SystemTap steps over on a wiki page in DeveloperWorks.

http://www.ibm.com/developerworks/wikis/display/LinuxP/SystemTap+SIGUSR2+tracing

My only wish out of this exercise was that the SystemTap community had even more "off-the-shelf" scripts which were certified as "safe to use" on customer real-life systems. I have to admit I get a little leary of popping scripts like this on a customer system - so it's something we carefully tested first on a local system. A quick matrix of scripts which are tested and safe on Distro versions / platform combinations would be very nice to have.

Thursday, July 17, 2008

RHEL 5.2 and HPC performance hints

Building on the SLES 10 sp2 kernel build post from a couple of weeks ago, we got the equivalent RHEL 5.2 page posted under the developerWorks umbrella. Mostly the same conceptual steps, but a little different in the specifics. And of course, in the RHEL 5.2 example, we reverse the example from SLES 10 by building a 4KB kernel where "normally" the RHEL 5.2 kernel is based on the 64KB pages. It's a good experiment to play with when you want to see the performance gains that emerge from leveraging larger page sizes.

Re-building a RHEL 5 kernel for Power

We linked this in under the HPC Central wiki page where several of us are playing around with adding descriptive how-to's for HPC workloads based on practical experience.

See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.

Tuesday, July 15, 2008

The building blocks of HPC

Top 500 again. Linpack HPL. Hitting half a teraflop on a single system.

Using RHEL 5.2 on a single IBM Power 575 system, Linux was able to hit half a teraflop with Linpack. These water-cooled systems are pretty nice. Thirty-two POWER6 cores packed into a fairly dense 2U rack form factor. These systems are designed for clusters, so 14 nodes (14 single systems) can be loaded into a single rack. Water piping winds its way into each system and over the cores (we of course had to pop one open to see how things looked and worked). The systems can be loaded with 128GB or 256GB of memory. A colleague provided a nice summary of the Linpack result over on IBM's developerWorks.

For Linux, there are several interesting pieces, especially as we look at Linpack as one of the key workloads that takes advantage of easy HPC building blocks. RHEL 5.2 comes with 64KB pages. The 64KB pages provides easy performance gains out of the box. The commercial compilers and math libraries provide the tailored and easy exploitation of the POWER6 systems. Running Linpack on clusters is the whole basis for the Top 500 workloads.

It's easy to take advantage of the building blocks in RHEL 5.2. OpenMPI in particular, the Infiniband stack, libraries tuned for the POWER hardware are all included. When we fire up a cluster node, we automatically install these components.

openmpi including -devel packages
openib
libehca
libibverbs-utils
openssl

These building blocks allow us to take the half-a-teraflop single system Linpack result and begin running it "out-of-the-box" on multiple nodes. There are cluster experts around that I'm learning from. Lots of interesting new challenges in the interconnect technologies and configurations. In this realm, I'm learning that one of the technology shifts emerging is the 10GBe (10GB Ethernet) interconnect vs Infiniband. Infiniband has all sorts of learning curves associated with it. Everytime I try to do something with Infiniband, I'm finding another thing to learn. It'll be interesting to see whether the 10GBe technology will be more like simply plugging in an Ethernet cable and off we go. A good summer project...

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.