Improving performance on Linux (mostly Power)

Various thoughts on the process of improving performance on a Linux system - in a mode of discovering just how much there is to learn. Customers use their systems uniquely - some care passionately about performance, some just want and expect the best "out-of-the-box" experience with no tweaking. I have observed that people in search of performance answers generally want the simple answer, but the practiced answer to any real performance question is: "Well, it depends..." - Bill Buros

Monday, June 6, 2011

gcc performance - getting better all of the time

There's quite a bit going on in the world of Linux on Power, where teams have been focused on improvements for gcc performance. Lately, a series of articles have been published on DeveloperWorks which nicely highlight the performance gains that gcc (packaged in the Advance Toolchain) provides over the gcc packaged with the Linux operating system.

Two articles are available which dive into performance gains across a number of workloads embedded in the SPECcpu2006 suite. The approach is simple. Use gcc as bundled with the version and release of the operating system, measure the performance. Then install the Advance Toolchain (a couple of rpms), change the path to gcc, re-build, re-run, and compare the performance.

Advance Toolchain 3.0 performance improvements

Advance Toolchain 4.0 performance improvements

Naturally, your mileage will vary.

(this is a re-post of an entry from a new "Think Power Linux" community being developed under IBM's DeveloperWorks realm)

Friday, November 19, 2010

SC10 - supercomputing conference

Just got back from Supercomputing 2010 (SC10) in New Orleans.

The annual conference provides a good perspective of the progress of "super computers", across universities, researchers, and industry. In recent years, the implementation of supercomputing has been slowly evolving towards a mix of compute core technologies, emerging more strongly along the lines of heterogeneous computing.

There was of course too much to see and participate in each day. The two areas where I spent most of my time was in catching up on the progress of GPUs (and the variants) and noticing the surprising emerging focus on HPC clouds. I'll post more on those two areas in the coming days.

There was quite a bit of discussion around Exascale computing. Naturally, with so many researchers at the conference, there were numerous perspectives on what that even meant. Exaflops? 1000 times faster? bigger? number of cores/GPUs? Power consumption, etc etc. If nothing else, more good fodder for research and areas of discussions. Which is one of the reasons everyone gets together. So all good. Clearly though, there's a lot of very serious challenges coming down the road for the vision of super-computers 8-10 years out.

As usual, the latest Top 500 list was announced at the conference. The announcement was interesting given a pervasive feeling across the conference that measuring supercomputers with a small and relatively trivial program like Linpack was clearly outdated. It was funny how often researchers and speakers would voice annoyance or frustration at the continuing featured aspect of Linpack. I suspect this is not a new sentiment, but I was struck by the number of specific references at this conference.

A while back, HPC Challenge (as a suite of benchmarks) was created to provide a more comprehensive measurement of the many aspects of a super computer. It also provides a number of awards on the varying aspects of super-computing, essentially awarding gold, silver, and bronze awards in each category. I may not have been paying well enough, but I didn't get the impression that HPC Challenge was a focus across the conference.

At the conference a new "Top 500" benchmark suite was launched. The new suite, graph500, was introduced and the first listed results at graph500.org. This new benchmark suite is intended to focus on areas of data intensive computing, another critical aspect of supercomputing. Indeed, it's often the amount of data being processed, consumed, and in many cases transformed into visualization which represents one of several awe-inspiring sides of supercomputing. In fact, this is the killer side of HPC cloud computing - getting data to and from the compute resources in the cloud.

Anyway, came back with over 50 papers from the Proceedings and over 30 Tutorials. Plenty to dig through, discuss, and get more insights. Will work to post more thoughts as interesting pieces are uncovered.

Friday, November 12, 2010

So really, does the Advance Toolchain help performance?

We're often asked whether - and by how much - the Advance Toolchain actually helps performance for applications running on the various distros (RHEL and SLES) on POWER systems.

The Advance Toolchain of course is a set of updated rpms which provide an updated gcc compiler, processor-tuned libraries, and a number of more current tools over what is standard in the distro itself. Check out one of the README files available for more details.

The ability to easily flip to a newer "toolchain" for POWER7 systems has been particularly helpful for many applications. An article was recently completed which demonstrates the relative performance gains when leveraging the latest Advance Toolchain (version 3.0-1) from the University of Illinois over the gcc which comes packaged with each distro release.

The article "Advance Toolchain performance improvements" provides the details and graphs of component-by-component breakdowns of engineering runs of SPECcpu2006® for integer and floating point workloads. The relative performance gains and losses are graphed which provides a quick view of the possibilities of using these libraries and newer gcc.

One of the key highlights is that the graphs very nicely demonstrate the repeating performance perspective of "Well, it depends...". Not all workloads will benefit from the libraries, but many do!

Indeed, a key advantage of the Advance Toolchain is the continuing focus, updates, optimizations, and fixes which provides users with the latest technologies in a form which can be serviced and supported by IBM.

SPEC® and the benchmark names SPECint® and SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation.

Tuesday, June 15, 2010

IBM and the Jeopardy Challenge

Check out this youtube video. Nice introduction on having a super-computer competing in series of example Jeopardy game shows.

This is just one of numerous fun and still seriously challenging projects being worked these days. The IBM Research teams are amazing. They have found some very interesting performance challenges. The project drives advanced technologies, product improvements, system improvements, performance tool improvements, but most of the work is in the realm of demonstrating complex natural language processing in a time-constrained answer and question world.

Friday, May 21, 2010

Busy time for 2.6.32 kernel on Power7

As things move forward with the launches of IBM's POWER7 systems this year, there's a flurry of activities on many fronts. In particular, we've been pretty busy with the 2.6.32 kernel in many places. Red Hat has a new RHEL Version 6 in beta, the BlueBioU project of course is using 2.6.32, and Novell has announced the latest service pack for SLES 11 which is based on 2.6.32.

Red Hat's RHEL 6 is already in beta (from last month):

* http://press.redhat.com/2010/04/21/red-hat-enterprise-linux-6-beta-available-today-for-public-download/

Novell just announced their latest service pack for SLES 11 which among other things upgrades the kernel to 2.6.32:

* http://www.novell.com/promo/suse/sle11sp1.html

A key proof point at BlueBioU continues to be worked on collaboratively across many teams. Being based on the 2.6.32 enables the easy availability of a number of latest Linux technologies.

* http://bluebiou.rice.edu/

For some example performance FAQs emerging with the work on the 2.6.32 kernel base, check out:

* http://www.ibm.com/developerworks/wikis/display/LinuxP/Performance+FAQs

To check on current questions being posed see the Linux on Power architecture forum at:

* http://www.ibm.com/developerworks/forums/forum.jspa?forumID=375

Some recent questions address the holes in CPU numbering on POWER7 when SMT=2 or SMT=1 is used, how to control the DSCR settings with the ppc64_cpu command, various tools questions, how to dynamically control the SMT settings on a system, page sizes on Linux, etc. Some of the questions are driving functional updates to the commands or approaches in development, so asking a leading question there is always a good thing.

Wednesday, April 7, 2010

open-source - patents - and what about performance?

Interesting debates arise around us every day. The latest "patent pledge" excitement on the web is interesting to watch. I tried to dig through and read the details of the various lists of patents, but the eyes glaze and I have to wonder who's pulling who's chain.

With some mild interest, I saw the Linux Foundation blog post this evening re-iterating the pledge from 2005 - quoting a statement from Dan Frye.

Linux Foundation - open source patent pledge

Now, I probably don't really count as an unbiased observer, working directly in Dan Frye's organization, but I will observe that in our day-to-day interactions with Linux and customers, our focus is promoting open-source solutions every day. Some of us even take some quiet personal delight in catching up and passing classic IBM proprietary solutions, but our real focus is getting customers up, running, and happy.

Admit'ably, my world is focused primarily on helping customers tune and improve the mostly open-source based deployments of fairly complex applications on Linux on POWER systems. In day to day work, I've been most impressed with the varied partners (both open-source and proprietary) that we implicitly and explicitly work with. The dedication to making things "just work" and "then work nicely" strikes me as the path that customers expect us to embrace.

The process of open-source and improving performance is usually a challenging process. We've got lists of cool performance things I'd love to see implemented. The gate to getting these pieces implemented is not the patents, it's the process of getting consensus and convincing the "community" to adopt something that'll work smoothly across the platforms. Once you have that, we've enabled our customers to have the pieces they need to implement, tune, understand, and optimize their applications and software stacks. And all in all, it's reassuring to see the calm reassurance of IBM's commitments and continued "work nice" community approach being reinforced. It's why many of us greatly prefer working in the Linux space.

Now, back to ganglia and CPU utilization. Something's not quite right there. More on that next week.

Wednesday, March 24, 2010

Busy month for Linux - NCSA picks Linux for Blue Waters

Along the lines of interesting - but not particularly surprising - trends in the industry, we notice that the NCSA has recently endorsed the continuing evolution of HPC workloads towards a Linux base.

Check out the NCSA article titled: "Linux selected as operating system for Blue Waters".

This is particularly encouraging for not only the operating system and in our case the POWER7 platform, but also for the various software stack products that are being developed, improved, enhanced, and deployed in high-demand HPC environments across many industries.

Massive HPC clusters on the scale of projects like Blue Waters are exciting just looking at all of the technologies being worked on. There's a nice Blue Waters project newsletter available which shows the breadth of activities happening around this project.

One of our real-life challenges is to improve the whole food-chain of software products which enable easier deployment of HPC applications on the Linux base and the POWER7 platform. Collaborative projects like the Blue BioU project can provide an expanding community of Linux HPC users access to open-source based POWER7 clusters.

Monday, March 8, 2010

IBM donates Linux-based POWER7 super-computer

Rice University has deployed a sweet POWER7 based cluster based on Linux. The brand new IBM 750 systems in the cluster are based on Linux, Maui, Torque, openmpi, Infiniband, 10Gb Ethernet, the Advance Toolchain, IBM compilers and IBM's ESSL math libraries.

Check out http://bluebiou.rice.edu/.

Over the coming weeks, we'll describe numerous collaborative efforts underway to develop, deploy, leverage, and execute workloads on the cluster.

One of the exciting aspects of the project is the team at Rice University is well versed in managing and deploying open-source-based production-level HPC clusters used by hundreds of students and researchers.

The initial goal is to integrate the cluster into an existing Maui/Torque infrastructure deployed and in use at Rice University. Extending the infrastructure to allow researchers control of POWER7 SMT hardware threads on each node, the number of 16MB huge pages if desired, energy optimization techniques, and POWER tuning techniques.

Thursday, February 18, 2010

And here comes POWER7 !

While it's been some time since I last posted much, time flies when you're working on a new generation of hardware and systems.

The latest POWER7 based systems were just recently announced. Naturally, Linux is supported. Linux already exploits the POWER7 technologies and more is coming. For example, SLES 11 has the POWER7 enabling available and was used for numerous standard benchmarks used when we launch systems and operating system updates. For some quick performance data, check out this link.

IBM Power 750 Performance Data

In the days and weeks coming, several of us will be collaborating together to post insights, hints, and tips on using Linux to exploit the capabilities of IBM's latest POWER-based systems. The system capabilities of what's being delivered and what's coming down the pipeline is pretty impressive.

For more links, see Linux Performance.

For a taste of what's coming, see IBM's Statement of Direction on IBM Power Systems high-end servers.

Wednesday, October 22, 2008

Hey! Who's stealing my CPU cycles?!

I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.

The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.

In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
0  0      0 14578432 408768 943616    0    0     0     0    2    5  0  0 100  0  0
0  0      0 14578368 408768 943616    0    0     0     0   25   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0    32   12   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0     0   21   45  0  0 100  0  0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.

Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   11   34 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574656 408704 943488    0    0     0     0   10   34 50  0  0  0 50

For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

Measuring Stolen CPU cycles

Wednesday, October 15, 2008

Linux on Power.. links and portals..

Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

Thursday, October 9, 2008

SystemTap SIGUSR2 tracing

A story of how people can get all excited about "performance issues". And how simple and easy tools can help figure out the root cause.

I was involved recently in an issue with Java performance where a system had "flipped out" and gone to 100% CPU busy across all of the processor cores after a software upgrade. Numerous technical people got involved in a flurry of activity, and of course performance teams were involved because systems tend to perform poorly when the CPUs are busy 100% of the time.

Various people dug into things and determined that something was generating way-too-many SIGUSR2 signals to Java, which apparently was driving the Java engine into constant garbage collection. Many many many Java threads running. It was a sight to behold.

Naturally, everyone involved claimed that *they* were not responsible for the spurious SIGUSR2 signals. So clearly this was a hardware issue. As hardware engineers were being brought in, some peers in the SystemTap team quietly suggested that we use the SystemTap tool to help figure out who was sending the signals. Ahh, a note of reason.

Turns out there's an off-the-shelf SystemTap script available which snags the Signal.send entry point, and is already instrumented to track who's triggering the signal, and then count up the number of times each source triggers a signal.

The script showed who was triggerig the signals, which was exactly the ah-hah!" clue the technical teams wanted and needed. Shortly there-after a fix was found and applied, and the 100% CPU busy condition was solved. With the CPUs no longer 100% busy for no particular good reason, our performance work was done. If only these were all this easy.

We documented the SystemTap steps over on a wiki page in DeveloperWorks.

http://www.ibm.com/developerworks/wikis/display/LinuxP/SystemTap+SIGUSR2+tracing

My only wish out of this exercise was that the SystemTap community had even more "off-the-shelf" scripts which were certified as "safe to use" on customer real-life systems. I have to admit I get a little leary of popping scripts like this on a customer system - so it's something we carefully tested first on a local system. A quick matrix of scripts which are tested and safe on Distro versions / platform combinations would be very nice to have.

Thursday, July 17, 2008

RHEL 5.2 and HPC performance hints

Building on the SLES 10 sp2 kernel build post from a couple of weeks ago, we got the equivalent RHEL 5.2 page posted under the developerWorks umbrella. Mostly the same conceptual steps, but a little different in the specifics. And of course, in the RHEL 5.2 example, we reverse the example from SLES 10 by building a 4KB kernel where "normally" the RHEL 5.2 kernel is based on the 64KB pages. It's a good experiment to play with when you want to see the performance gains that emerge from leveraging larger page sizes.

Re-building a RHEL 5 kernel for Power

We linked this in under the HPC Central wiki page where several of us are playing around with adding descriptive how-to's for HPC workloads based on practical experience.

See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.

Tuesday, July 15, 2008

The building blocks of HPC

Top 500 again. Linpack HPL. Hitting half a teraflop on a single system.

Using RHEL 5.2 on a single IBM Power 575 system, Linux was able to hit half a teraflop with Linpack. These water-cooled systems are pretty nice. Thirty-two POWER6 cores packed into a fairly dense 2U rack form factor. These systems are designed for clusters, so 14 nodes (14 single systems) can be loaded into a single rack. Water piping winds its way into each system and over the cores (we of course had to pop one open to see how things looked and worked). The systems can be loaded with 128GB or 256GB of memory. A colleague provided a nice summary of the Linpack result over on IBM's developerWorks.

For Linux, there are several interesting pieces, especially as we look at Linpack as one of the key workloads that takes advantage of easy HPC building blocks. RHEL 5.2 comes with 64KB pages. The 64KB pages provides easy performance gains out of the box. The commercial compilers and math libraries provide the tailored and easy exploitation of the POWER6 systems. Running Linpack on clusters is the whole basis for the Top 500 workloads.

It's easy to take advantage of the building blocks in RHEL 5.2. OpenMPI in particular, the Infiniband stack, libraries tuned for the POWER hardware are all included. When we fire up a cluster node, we automatically install these components.

openmpi including -devel packages
openib
libehca
libibverbs-utils
openssl

These building blocks allow us to take the half-a-teraflop single system Linpack result and begin running it "out-of-the-box" on multiple nodes. There are cluster experts around that I'm learning from. Lots of interesting new challenges in the interconnect technologies and configurations. In this realm, I'm learning that one of the technology shifts emerging is the 10GBe (10GB Ethernet) interconnect vs Infiniband. Infiniband has all sorts of learning curves associated with it. Everytime I try to do something with Infiniband, I'm finding another thing to learn. It'll be interesting to see whether the 10GBe technology will be more like simply plugging in an Ethernet cable and off we go. A good summer project...

Sunday, June 29, 2008

Building a distro kernel on Power - not so bad

This should be simple. And when you know all the steps, it is. But I was surprised how challenging it's been to find easy examples of the steps to re-build a commercially shipping "distro" kernel, in this case the SLES 10 sp1 kernel.

It's probably documented cleverly in the end user documentation - but I'm far too addicted to the ease of googling compared to the inevitable drudgery of digging through user documentation. I always wonder when the "documentation community" will simply shift to wiki pages to document, but more importantly, maintain the correctness and accessibility, of end user documentation.

For this exercise, turns out we wanted to do something simple to a SLES 10 kernel shipping on Power. In our case, we wanted to see if we could re-build the distro kernel to support the 64KB pages available in the Power6 hardware systems. For the performance angle, 64KB pages can often significantly improve the performance of applications. Normally, when working with the Linux community, we simply snag the latest mainline kernel and work with that, but in this case, we were really interested in the specific performance differences between 4KB today and the expected performance of 64KB pages on the same base.

Out of that exercise, we created a new wiki page which documented the steps to re-build the SLES 10 kernel. A peer, Peter Wong, has already documented the RHEL 5.2 steps, we're just waiting for some web site maintenance to complete on the IBM developerWorks infrastructure to get that page posted as well.

For the SLES 10 sp1 (and sp2) kernel re-build instructions, see

http://www.ibm.com/developerworks/wikis/display/LinuxP/Re-building+a+SLES+10+kernel+for+Power

Recently Jon Tollefson was playing around on the SLES 10 sp2 kernel and found that there's a missing file in the SLES 10 sp2 kernel package, so we had to comment out a line in the kernel-ppc64.spec file (modprobe.d-qeth).

One interesting aspect is I had thought the kernel re-build process would be precise and seamless. But there were a few tricks that had to be done to make it work.

One of them was adding control to be able to run the make on all of the CPUs seen by Linux.

%define jobs %(cat /proc/cpuinfo | grep processor | wc -l)

We've been playing recently on some of the sweet top-of-the-line POWER6 systems, in one case the Power 575 system with 32 cores. When running with SMT enabled, that's 64 CPUs that Linux controls. The kernel build goes very fast on that system.

Second, and there's probably a more clever way to do this, but we ended up having to unpack, modify, and re-pack the config.tar.bz2 file for the platform.

The last interesting aspect was the built-in "kabi" protections. When we first re-built the kernel, the build failed because this failed the kabi tolerance level. Very clever. I assume various kernel interfaces are flagged with KABI values, which when changed, cause the build to fail. In our case, we knew it would change things in the kernel, so we modified the tolerance value to allow for the kernel re-build.

So. Easy to do, easy to make changes, and for a performance team, easy to minimize how much is changing from one step to the next. By starting with a known entity in the distro kernel, we make one change, verify the performance differences, and then proceed to the next change. Simple. Methodical. Straight-forward.

Monday, May 5, 2008

PowerVM Lx86. Technology that works.

PowerVM Lx86. Running Linux 32-bit x86 apps on Power systems. Slick. Super-easy with Linux. And does it work? Absolutely.

Now, as a performance analyst, I'm often asked: "Is it a performance play?" My quick answer: "Nah - not usually..." But for everything I've tried: "It just works" - which in and of itself is pretty cool. You really want the best performance for your app? Re-compile and run it natively. Duh. You want easy access to existing x86 compiled apps? Give this product a shot. And in some cases, the performance of the translated product is just fine for the user's needs.

In essence, this product is the flip-side of Transitive's translator technology for Apple which translates older Apple Power applications to run on the new x86-based Apple systems. Check out these web sites if you missed the technology introduction several years ago:

http://www.transitive.com/customers/apple and

http://www.apple.com/rosetta/

IBM and Transitive (http://transitive.com/customers/ibm) have already introduced the second release (Ver 1.2) of the IBM PowerVM Lx86 product.

Originally discussed in the press as p-AVE (for example, see an article from http://www.it-analysis.com/), IBM's product naming wizards must have been at work with the preliminary name of IBM System p Application Virtual Environment (System p AVE). Later they followed it with a newer official name under the IBM PowerVM umbrella as PowerVM Lx86 for x86 Linux applications. "p-AVE" certainly rolled off the tongue far easier than the PowerVM Lx86 name. But the PowerVM naming admit'ably fits better with the overall virtualization strengths of the Power line.

For a page full of pointers and interesting helpful hints, check out http://www-128.ibm.com/developerworks/linux/lx86/.

For example, the product is download'able from IBM's web site... gotta dig through the DeveloperWorks various registration pages - but you're looking for: IBM PowerVM Lx86 V1.2 (formerly System p Application Virtual environment or p AVE) p-ave-1.2.0.0-1.tar (8,294,400 bytes).

For a clever approach to using PowerVM Lx86, a nice demo was created which you can see on YouTube.

I really like this demo since it highlights one of several cases where the translation performance of the product isn't perceivable to an end user. The demo is an actual run, no tricks. As the demo narrative says, when you try to execute an x86 application on the Power system, the x86 executable is automatically recognized and the translation environment invoked.

The video clip goes on to show some of the highlights of the Power line where Power logical partitions can be migrated from one physical system to another. More cool stuff.

Another example of common product usage is in the world of graphing performance results. Users can check out a really nice set of charting libraries from Advanced Software Engineering (http://www.advsofteng.com/) available with the ChartDirector product. The executable run-time libraries are available for a variety of platforms, including Linux on i386, but alas, not for Linux on ppc64 systems. But when the i386 libraries are installed on a Power system running Linux with the additional PowerVM Lx86 product, Power users can use the graphing routines directly. Again, the perceptible performance differences are minimal, and the full function of the i386 routines are available to the Power users.

The IBM web site for PowerVM Virtualization Software offerings has a good description of the capabilities of the Linux product and the services available for software vendors to enable their apps for native execution while still exploiting the Power systems running Linux with their existing applications.

Keep in mind there are the normal obligatory footnotes and qualifications on what i386 applications can function under this product - check out the product web sites for that information.

Finally, as a performance team, we always tend to agonize over the corner cases which highlight the performance challenges of translating an application from one system platform base to another, and there certainly are some areas where performance can be a challenge. Java is a good example. There are too many steps of translating byte codes to executables, then those executables are translated again to execute on the Power platform, which can make for a rather poor execution path. If your Java app is a minor piece of a bigger application (the prime example is as an application installer), shrug. But whew, if you're thinking about snagging a full comprehensive Java based product and running it in in translation mode - as opposed to verifying that the Java code runs on the Power platform - I can anticipate you may be disappointed with the performance. One would've hoped that the Java world of write once, run anywhere would've panned out better than the write once test everywhere implementation.

In the meantime, if you need easy access to x86 executables and applications on your Power systems, give this a shot.

Monday, April 7, 2008

What "one thing" do you want in the Linux kernel?

An interesting question came by this week. If I could have one free wish, one thoughtful choice, one wise selection, what would I want in the Linux kernel?

I was of course assured that my choice didn't mean I'd get it, but it was an interesting hallway poll for the day. I suspect the answers were being pulled together as input for this week's Linux Foundation meetings happening this week in Austin Tx... lots of people in town.

The mind races.

So back to the Linux Weather Forecast. Whew. What to choose, what to choose.

http://www.linux-foundation.org/en/Linux_Weather_Forecast written up by Jonathan Corbet of lwn.net fame.

In no particular order,

A new completely fair scheduler - the promise of an updated scheduler is intriguing for performance teams. Course, on a practical level, the thoughts of rather extensive regression testing keeps popping up in my mind. Watching the progress of that effort is reassuring though, so this will be cool in the next revs of the distros.
Kernel markers - cool technology which will make it easier for tools to hook into the kernel, which of course a performance team is always interested in. We need to make it absolutely seamless and safe for a customer, on a production system, as a protected but definitely non-root user, to gather system metric information.
Memory use profiling- now this will be really nice - way too many times we're asked about memory usage of an application - which is particularly dependent on other things happening in the operating system.
Real-time enhancements - continued integration of the real-time work happening in parallel in the community. This is proving particularly helpful in the HPC cluster space as work continues to improve the deterministic behavior of the operating system.
Memory fragmentation avoidance - another longer-term project which positions the kernel for far more aggressive memory management and control of varying page sizes.
And numerous other choices... better filesystems in ext4 and btrfs, better virtualization and container support, better energy management, improvements in glibc, etc etc

But really? All of these are in play today. All of them are being polished, updated, with many creative minds at work.

So what I asked for was for the kernel and Linux community to continue strengthening the ability to help customers easily understand and improve the performance of their applications and the system. Out of the box. Across all of the components. No special adds, rpms, or kernels on their installed system.

The nice part is the Linux programmers I work with are all committed to this. So the work continues - fit, finish, and polish - and continue working with the longer term changes which are being developed.

Tuesday, March 18, 2008

Reboot? Real customers don't reboot.

To get clear, clean, and repeat'able performance results, performance teams in the labs generally don't think twice about rebooting the system and starting fresh (after all, who knows what someone has done on the system - particularly those kernel programmers).

In my experience though, to balance this automatic tendency, it's important to note that re-booting a system isn't normal behavior for real-life customers.

For example, when I suggest that a customer re-boot his/her system, there's usually a perceptible pause, sometimes even a quiet chuckle. Turns out most customers are quite happy with their systems running Linux, and simply don't consider the process of re-booting as anything normal. I'm not surprised, but it does mean that the tuning options practically available to customers have to be done dynamically and not a kernel boot option.

This aspect continues to be improved in the Linux with work across the operating system. The ability to control energy consumption, system resource usage, adding/removing CPUs, and adding/removing memory are all examples of cool things being worked on in the Linux community.

For the performance team, we're particularly interested in the ability to control things like SMT on and off (something needed on Power systems), the number of CPUs running, and minimizing kernel memory fragmentation.

Some pieces are there, some are emerging, and some are being invested in. I'll hunt some examples down and post them here in the coming days.

Tuesday, February 12, 2008

RHEL5.1 exploits POWER6

Whew. The months tick by quickly these days. Back in October 2007, Red Hat released a service update which officially supported and exploited the IBM POWER6 systems. The Red Hat exploitation code had been worked on by LTC programmers from around the world - providing a Linux OS version that handily and easily provides very nice performance for enterprise customers in production mode.

The performance teams submitted a series of SPEC.org (see the Appendix in the white paper link below for the list) and single-system Linpack publishes at the time, which have since been reviewed, approved, and published. One of the things we really like to focus on is the "whole software stack" when performance is looked at. So a short white paper was written up to explain the various software pieces and how they were tuned. To me this is far more interesting than the bottom-line metrics and the various sparring leadership claims that bounce around the industry. 'Course, it's always fun to have those leadership claims while they last.

The paper - an interesting mix of marketing introductions and then some technical details - really is intended to focus on the basics of leveraging the platform specific compilers (IBM's XL C/C++ and Fortran compilers), platform specific tuned math libraries (ESSL), a Linux open source project called libhugetlbfs, MicroQuill's clever SmartHeap product, FDPR-pro - which I've posted on earlier, IBM's Java, all running examples on the IBM System p 570 system.

Following these activities and working with various customers and interested programmers, there's some interesting work going on with using the oprofile analysis tools on Linux - especially as a non-root user - and working on understanding edge cases where the performance of a workload isn't purely predictable and deterministic. What's interesting is the possible performance variability of a workload when memory being used by a program isn't regularly placed local to each executing core (a classic NUMA-type issue), especially in the case of varying page sizes supported by the Linux operating systems.

I'll post some examples of using oprofile in different ways this month, and also an example of a Fortran program which can show varying run-times and how to address that.

Friday, February 1, 2008

Winning with Linux on new Power systems

IBM just announced two new POWER6 systems, the IBM System p 520 and IBM System p 550. In essence, p 520 has up to 4 cores of a 4.2GHz POWER6 processor and up to 64GB memory, while the p 550 has up to 8 cores and up to 256GB memory. The two systems are sweet little systems, and I recommend checking them out.

Elisabeth Stahl has a good summary blog post on the leadership publishes submitted earlier this week for these new systems. I point to her blog since she nicely has all of the requisite disclaimers and pointers to the data points submitted to SPEC.org and with SAP this week. It takes a couple of weeks for the various review processes to complete, so it'll be easier to comment on these once they've been reviewed, approved, and published. I have found that Linux programmers like to see the published files and walk through the specific details. Show me.

The cool part is Linux on Power continues to be a parity player in the leadership performance metrics for POWER6 customers (in this case using examples for Linpack, SPEC CPU2006, SPECjbb2005, and SAP). There are summaries of some of the submitted bottom-line numbers for AIX and Linux for the p 520 and the p 550 on IBM's website.

There's a short paper which should be out soon that discusses the simple steps and software products that can be used on Linux on Power to achieve the best performance for POWER6 using Linux. It's based on similar publishes done last October and uses actual results that are published on the SPEC.org website.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.