Improving performance on Linux (mostly Power)

gcc performance - getting better all of the time

noreply@blogger.com (Bill Buros) — Mon, 06 Jun 2011 14:48:00 +0000

There's quite a bit going on in the world of Linux on Power, where teams have been focused on improvements for gcc performance. Lately, a series of articles have been published on DeveloperWorks which nicely highlight the performance gains that gcc (packaged in the Advance Toolchain) provides over the gcc packaged with the Linux operating system.

Two articles are available which dive into performance gains across a number of workloads embedded in the SPECcpu2006 suite. The approach is simple. Use gcc as bundled with the version and release of the operating system, measure the performance. Then install the Advance Toolchain (a couple of rpms), change the path to gcc, re-build, re-run, and compare the performance.

Advance Toolchain 3.0 performance improvements

Advance Toolchain 4.0 performance improvements

Naturally, your mileage will vary.

(this is a re-post of an entry from a new "Think Power Linux" community being developed under IBM's DeveloperWorks realm)

SC10 - supercomputing conference

noreply@blogger.com (Bill Buros) — Fri, 19 Nov 2010 16:43:00 +0000

Just got back from Supercomputing 2010 (SC10) in New Orleans.

The annual conference provides a good perspective of the progress of "super computers", across universities, researchers, and industry. In recent years, the implementation of supercomputing has been slowly evolving towards a mix of compute core technologies, emerging more strongly along the lines of heterogeneous computing.

There was of course too much to see and participate in each day. The two areas where I spent most of my time was in catching up on the progress of GPUs (and the variants) and noticing the surprising emerging focus on HPC clouds. I'll post more on those two areas in the coming days.

There was quite a bit of discussion around Exascale computing. Naturally, with so many researchers at the conference, there were numerous perspectives on what that even meant. Exaflops? 1000 times faster? bigger? number of cores/GPUs? Power consumption, etc etc. If nothing else, more good fodder for research and areas of discussions. Which is one of the reasons everyone gets together. So all good. Clearly though, there's a lot of very serious challenges coming down the road for the vision of super-computers 8-10 years out.

As usual, the latest Top 500 list was announced at the conference. The announcement was interesting given a pervasive feeling across the conference that measuring supercomputers with a small and relatively trivial program like Linpack was clearly outdated. It was funny how often researchers and speakers would voice annoyance or frustration at the continuing featured aspect of Linpack. I suspect this is not a new sentiment, but I was struck by the number of specific references at this conference.

A while back, HPC Challenge (as a suite of benchmarks) was created to provide a more comprehensive measurement of the many aspects of a super computer. It also provides a number of awards on the varying aspects of super-computing, essentially awarding gold, silver, and bronze awards in each category. I may not have been paying well enough, but I didn't get the impression that HPC Challenge was a focus across the conference.

At the conference a new "Top 500" benchmark suite was launched. The new suite, graph500, was introduced and the first listed results at graph500.org. This new benchmark suite is intended to focus on areas of data intensive computing, another critical aspect of supercomputing. Indeed, it's often the amount of data being processed, consumed, and in many cases transformed into visualization which represents one of several awe-inspiring sides of supercomputing. In fact, this is the killer side of HPC cloud computing - getting data to and from the compute resources in the cloud.

Anyway, came back with over 50 papers from the Proceedings and over 30 Tutorials. Plenty to dig through, discuss, and get more insights. Will work to post more thoughts as interesting pieces are uncovered.

So really, does the Advance Toolchain help performance?

noreply@blogger.com (Bill Buros) — Fri, 12 Nov 2010 21:22:00 +0000

We're often asked whether - and by how much - the Advance Toolchain actually helps performance for applications running on the various distros (RHEL and SLES) on POWER systems.

The Advance Toolchain of course is a set of updated rpms which provide an updated gcc compiler, processor-tuned libraries, and a number of more current tools over what is standard in the distro itself. Check out one of the README files available for more details.

The ability to easily flip to a newer "toolchain" for POWER7 systems has been particularly helpful for many applications. An article was recently completed which demonstrates the relative performance gains when leveraging the latest Advance Toolchain (version 3.0-1) from the University of Illinois over the gcc which comes packaged with each distro release.

The article "Advance Toolchain performance improvements" provides the details and graphs of component-by-component breakdowns of engineering runs of SPECcpu2006® for integer and floating point workloads. The relative performance gains and losses are graphed which provides a quick view of the possibilities of using these libraries and newer gcc.

One of the key highlights is that the graphs very nicely demonstrate the repeating performance perspective of "Well, it depends...". Not all workloads will benefit from the libraries, but many do!

Indeed, a key advantage of the Advance Toolchain is the continuing focus, updates, optimizations, and fixes which provides users with the latest technologies in a form which can be serviced and supported by IBM.

SPEC® and the benchmark names SPECint® and SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation.

IBM and the Jeopardy Challenge

noreply@blogger.com (Bill Buros) — Tue, 15 Jun 2010 18:21:00 +0000

Check out this youtube video. Nice introduction on having a super-computer competing in series of example Jeopardy game shows.

This is just one of numerous fun and still seriously challenging projects being worked these days. The IBM Research teams are amazing. They have found some very interesting performance challenges. The project drives advanced technologies, product improvements, system improvements, performance tool improvements, but most of the work is in the realm of demonstrating complex natural language processing in a time-constrained answer and question world.

Busy time for 2.6.32 kernel on Power7

noreply@blogger.com (Bill Buros) — Fri, 21 May 2010 13:48:00 +0000

As things move forward with the launches of IBM's POWER7 systems this year, there's a flurry of activities on many fronts. In particular, we've been pretty busy with the 2.6.32 kernel in many places. Red Hat has a new RHEL Version 6 in beta, the BlueBioU project of course is using 2.6.32, and Novell has announced the latest service pack for SLES 11 which is based on 2.6.32.

Red Hat's RHEL 6 is already in beta (from last month):

* http://press.redhat.com/2010/04/21/red-hat-enterprise-linux-6-beta-available-today-for-public-download/

Novell just announced their latest service pack for SLES 11 which among other things upgrades the kernel to 2.6.32:

* http://www.novell.com/promo/suse/sle11sp1.html

A key proof point at BlueBioU continues to be worked on collaboratively across many teams. Being based on the 2.6.32 enables the easy availability of a number of latest Linux technologies.

* http://bluebiou.rice.edu/

For some example performance FAQs emerging with the work on the 2.6.32 kernel base, check out:

* http://www.ibm.com/developerworks/wikis/display/LinuxP/Performance+FAQs

To check on current questions being posed see the Linux on Power architecture forum at:

* http://www.ibm.com/developerworks/forums/forum.jspa?forumID=375

Some recent questions address the holes in CPU numbering on POWER7 when SMT=2 or SMT=1 is used, how to control the DSCR settings with the ppc64_cpu command, various tools questions, how to dynamically control the SMT settings on a system, page sizes on Linux, etc. Some of the questions are driving functional updates to the commands or approaches in development, so asking a leading question there is always a good thing.

open-source - patents - and what about performance?

noreply@blogger.com (Bill Buros) — Thu, 08 Apr 2010 02:44:00 +0000

Interesting debates arise around us every day. The latest "patent pledge" excitement on the web is interesting to watch. I tried to dig through and read the details of the various lists of patents, but the eyes glaze and I have to wonder who's pulling who's chain.

With some mild interest, I saw the Linux Foundation blog post this evening re-iterating the pledge from 2005 - quoting a statement from Dan Frye.

Linux Foundation - open source patent pledge

Now, I probably don't really count as an unbiased observer, working directly in Dan Frye's organization, but I will observe that in our day-to-day interactions with Linux and customers, our focus is promoting open-source solutions every day. Some of us even take some quiet personal delight in catching up and passing classic IBM proprietary solutions, but our real focus is getting customers up, running, and happy.

Admit'ably, my world is focused primarily on helping customers tune and improve the mostly open-source based deployments of fairly complex applications on Linux on POWER systems. In day to day work, I've been most impressed with the varied partners (both open-source and proprietary) that we implicitly and explicitly work with. The dedication to making things "just work" and "then work nicely" strikes me as the path that customers expect us to embrace.

The process of open-source and improving performance is usually a challenging process. We've got lists of cool performance things I'd love to see implemented. The gate to getting these pieces implemented is not the patents, it's the process of getting consensus and convincing the "community" to adopt something that'll work smoothly across the platforms. Once you have that, we've enabled our customers to have the pieces they need to implement, tune, understand, and optimize their applications and software stacks. And all in all, it's reassuring to see the calm reassurance of IBM's commitments and continued "work nice" community approach being reinforced. It's why many of us greatly prefer working in the Linux space.

Now, back to ganglia and CPU utilization. Something's not quite right there. More on that next week.

Busy month for Linux - NCSA picks Linux for Blue Waters

noreply@blogger.com (Bill Buros) — Wed, 24 Mar 2010 12:17:00 +0000

Along the lines of interesting - but not particularly surprising - trends in the industry, we notice that the NCSA has recently endorsed the continuing evolution of HPC workloads towards a Linux base.

Check out the NCSA article titled: "Linux selected as operating system for Blue Waters".

This is particularly encouraging for not only the operating system and in our case the POWER7 platform, but also for the various software stack products that are being developed, improved, enhanced, and deployed in high-demand HPC environments across many industries.

Massive HPC clusters on the scale of projects like Blue Waters are exciting just looking at all of the technologies being worked on. There's a nice Blue Waters project newsletter available which shows the breadth of activities happening around this project.

One of our real-life challenges is to improve the whole food-chain of software products which enable easier deployment of HPC applications on the Linux base and the POWER7 platform. Collaborative projects like the Blue BioU project can provide an expanding community of Linux HPC users access to open-source based POWER7 clusters.

IBM donates Linux-based POWER7 super-computer

noreply@blogger.com (Bill Buros) — Mon, 08 Mar 2010 15:19:00 +0000

Rice University has deployed a sweet POWER7 based cluster based on Linux. The brand new IBM 750 systems in the cluster are based on Linux, Maui, Torque, openmpi, Infiniband, 10Gb Ethernet, the Advance Toolchain, IBM compilers and IBM's ESSL math libraries.

Check out http://bluebiou.rice.edu/.

Over the coming weeks, we'll describe numerous collaborative efforts underway to develop, deploy, leverage, and execute workloads on the cluster.

One of the exciting aspects of the project is the team at Rice University is well versed in managing and deploying open-source-based production-level HPC clusters used by hundreds of students and researchers.

The initial goal is to integrate the cluster into an existing Maui/Torque infrastructure deployed and in use at Rice University. Extending the infrastructure to allow researchers control of POWER7 SMT hardware threads on each node, the number of 16MB huge pages if desired, energy optimization techniques, and POWER tuning techniques.

And here comes POWER7 !

noreply@blogger.com (Bill Buros) — Thu, 18 Feb 2010 16:58:00 +0000

While it's been some time since I last posted much, time flies when you're working on a new generation of hardware and systems.

The latest POWER7 based systems were just recently announced. Naturally, Linux is supported. Linux already exploits the POWER7 technologies and more is coming. For example, SLES 11 has the POWER7 enabling available and was used for numerous standard benchmarks used when we launch systems and operating system updates. For some quick performance data, check out this link.

IBM Power 750 Performance Data

In the days and weeks coming, several of us will be collaborating together to post insights, hints, and tips on using Linux to exploit the capabilities of IBM's latest POWER-based systems. The system capabilities of what's being delivered and what's coming down the pipeline is pretty impressive.

For more links, see Linux Performance.

For a taste of what's coming, see IBM's Statement of Direction on IBM Power Systems high-end servers.

Hey! Who's stealing my CPU cycles?!

noreply@blogger.com (Bill Buros) — Wed, 22 Oct 2008 15:45:00 +0000

I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.

The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.

In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
0  0      0 14578432 408768 943616    0    0     0     0    2    5  0  0 100  0  0
0  0      0 14578368 408768 943616    0    0     0     0   25   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0    32   12   44  0  0 100  0  0
0  0      0 14578432 408768 943616    0    0     0     0   21   45  0  0 100  0  0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.

Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs   ---- -------memory-------   ---swap--  ---io--- --system-- -----cpu------
r  b   swpd    free   buff   cache   si   so   bi   bo   in   cs   us sy  id wa st
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   11   34 50  0  0  0 50
8  0      0 14574400 408704 943488    0    0     0     0   26   42 50  0  0  0 50
8  0      0 14574656 408704 943488    0    0     0     0   10   34 50  0  0  0 50

For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

Measuring Stolen CPU cycles

Linux on Power.. links and portals..

noreply@blogger.com (Bill Buros) — Wed, 15 Oct 2008 12:23:00 +0000

Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

SystemTap SIGUSR2 tracing

noreply@blogger.com (Bill Buros) — Thu, 09 Oct 2008 18:38:00 +0000

A story of how people can get all excited about "performance issues". And how simple and easy tools can help figure out the root cause.

I was involved recently in an issue with Java performance where a system had "flipped out" and gone to 100% CPU busy across all of the processor cores after a software upgrade. Numerous technical people got involved in a flurry of activity, and of course performance teams were involved because systems tend to perform poorly when the CPUs are busy 100% of the time.

Various people dug into things and determined that something was generating way-too-many SIGUSR2 signals to Java, which apparently was driving the Java engine into constant garbage collection. Many many many Java threads running. It was a sight to behold.

Naturally, everyone involved claimed that *they* were not responsible for the spurious SIGUSR2 signals. So clearly this was a hardware issue. As hardware engineers were being brought in, some peers in the SystemTap team quietly suggested that we use the SystemTap tool to help figure out who was sending the signals. Ahh, a note of reason.

Turns out there's an off-the-shelf SystemTap script available which snags the Signal.send entry point, and is already instrumented to track who's triggering the signal, and then count up the number of times each source triggers a signal.

The script showed who was triggerig the signals, which was exactly the ah-hah!" clue the technical teams wanted and needed. Shortly there-after a fix was found and applied, and the 100% CPU busy condition was solved. With the CPUs no longer 100% busy for no particular good reason, our performance work was done. If only these were all this easy.

We documented the SystemTap steps over on a wiki page in DeveloperWorks.

http://www.ibm.com/developerworks/wikis/display/LinuxP/SystemTap+SIGUSR2+tracing

My only wish out of this exercise was that the SystemTap community had even more "off-the-shelf" scripts which were certified as "safe to use" on customer real-life systems. I have to admit I get a little leary of popping scripts like this on a customer system - so it's something we carefully tested first on a local system. A quick matrix of scripts which are tested and safe on Distro versions / platform combinations would be very nice to have.

RHEL 5.2 and HPC performance hints

noreply@blogger.com (Bill Buros) — Thu, 17 Jul 2008 12:10:00 +0000

Building on the SLES 10 sp2 kernel build post from a couple of weeks ago, we got the equivalent RHEL 5.2 page posted under the developerWorks umbrella. Mostly the same conceptual steps, but a little different in the specifics. And of course, in the RHEL 5.2 example, we reverse the example from SLES 10 by building a 4KB kernel where "normally" the RHEL 5.2 kernel is based on the 64KB pages. It's a good experiment to play with when you want to see the performance gains that emerge from leveraging larger page sizes.

Re-building a RHEL 5 kernel for Power

We linked this in under the HPC Central wiki page where several of us are playing around with adding descriptive how-to's for HPC workloads based on practical experience.

See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.

The building blocks of HPC

noreply@blogger.com (Bill Buros) — Tue, 15 Jul 2008 12:04:00 +0000

Top 500 again. Linpack HPL. Hitting half a teraflop on a single system.

Using RHEL 5.2 on a single IBM Power 575 system, Linux was able to hit half a teraflop with Linpack. These water-cooled systems are pretty nice. Thirty-two POWER6 cores packed into a fairly dense 2U rack form factor. These systems are designed for clusters, so 14 nodes (14 single systems) can be loaded into a single rack. Water piping winds its way into each system and over the cores (we of course had to pop one open to see how things looked and worked). The systems can be loaded with 128GB or 256GB of memory. A colleague provided a nice summary of the Linpack result over on IBM's developerWorks.

For Linux, there are several interesting pieces, especially as we look at Linpack as one of the key workloads that takes advantage of easy HPC building blocks. RHEL 5.2 comes with 64KB pages. The 64KB pages provides easy performance gains out of the box. The commercial compilers and math libraries provide the tailored and easy exploitation of the POWER6 systems. Running Linpack on clusters is the whole basis for the Top 500 workloads.

It's easy to take advantage of the building blocks in RHEL 5.2. OpenMPI in particular, the Infiniband stack, libraries tuned for the POWER hardware are all included. When we fire up a cluster node, we automatically install these components.

openmpi including -devel packages
openib
libehca
libibverbs-utils
openssl

These building blocks allow us to take the half-a-teraflop single system Linpack result and begin running it "out-of-the-box" on multiple nodes. There are cluster experts around that I'm learning from. Lots of interesting new challenges in the interconnect technologies and configurations. In this realm, I'm learning that one of the technology shifts emerging is the 10GBe (10GB Ethernet) interconnect vs Infiniband. Infiniband has all sorts of learning curves associated with it. Everytime I try to do something with Infiniband, I'm finding another thing to learn. It'll be interesting to see whether the 10GBe technology will be more like simply plugging in an Ethernet cable and off we go. A good summer project...

Building a distro kernel on Power - not so bad

noreply@blogger.com (Bill Buros) — Sun, 29 Jun 2008 18:58:00 +0000

This should be simple. And when you know all the steps, it is. But I was surprised how challenging it's been to find easy examples of the steps to re-build a commercially shipping "distro" kernel, in this case the SLES 10 sp1 kernel.

It's probably documented cleverly in the end user documentation - but I'm far too addicted to the ease of googling compared to the inevitable drudgery of digging through user documentation. I always wonder when the "documentation community" will simply shift to wiki pages to document, but more importantly, maintain the correctness and accessibility, of end user documentation.

For this exercise, turns out we wanted to do something simple to a SLES 10 kernel shipping on Power. In our case, we wanted to see if we could re-build the distro kernel to support the 64KB pages available in the Power6 hardware systems. For the performance angle, 64KB pages can often significantly improve the performance of applications. Normally, when working with the Linux community, we simply snag the latest mainline kernel and work with that, but in this case, we were really interested in the specific performance differences between 4KB today and the expected performance of 64KB pages on the same base.

Out of that exercise, we created a new wiki page which documented the steps to re-build the SLES 10 kernel. A peer, Peter Wong, has already documented the RHEL 5.2 steps, we're just waiting for some web site maintenance to complete on the IBM developerWorks infrastructure to get that page posted as well.

For the SLES 10 sp1 (and sp2) kernel re-build instructions, see

http://www.ibm.com/developerworks/wikis/display/LinuxP/Re-building+a+SLES+10+kernel+for+Power

Recently Jon Tollefson was playing around on the SLES 10 sp2 kernel and found that there's a missing file in the SLES 10 sp2 kernel package, so we had to comment out a line in the kernel-ppc64.spec file (modprobe.d-qeth).

One interesting aspect is I had thought the kernel re-build process would be precise and seamless. But there were a few tricks that had to be done to make it work.

One of them was adding control to be able to run the make on all of the CPUs seen by Linux.

%define jobs %(cat /proc/cpuinfo | grep processor | wc -l)

We've been playing recently on some of the sweet top-of-the-line POWER6 systems, in one case the Power 575 system with 32 cores. When running with SMT enabled, that's 64 CPUs that Linux controls. The kernel build goes very fast on that system.

Second, and there's probably a more clever way to do this, but we ended up having to unpack, modify, and re-pack the config.tar.bz2 file for the platform.

The last interesting aspect was the built-in "kabi" protections. When we first re-built the kernel, the build failed because this failed the kabi tolerance level. Very clever. I assume various kernel interfaces are flagged with KABI values, which when changed, cause the build to fail. In our case, we knew it would change things in the kernel, so we modified the tolerance value to allow for the kernel re-build.

So. Easy to do, easy to make changes, and for a performance team, easy to minimize how much is changing from one step to the next. By starting with a known entity in the distro kernel, we make one change, verify the performance differences, and then proceed to the next change. Simple. Methodical. Straight-forward.

PowerVM Lx86. Technology that works.

noreply@blogger.com (Bill Buros) — Tue, 06 May 2008 02:17:00 +0000

PowerVM Lx86. Running Linux 32-bit x86 apps on Power systems. Slick. Super-easy with Linux. And does it work? Absolutely.

Now, as a performance analyst, I'm often asked: "Is it a performance play?" My quick answer: "Nah - not usually..." But for everything I've tried: "It just works" - which in and of itself is pretty cool. You really want the best performance for your app? Re-compile and run it natively. Duh. You want easy access to existing x86 compiled apps? Give this product a shot. And in some cases, the performance of the translated product is just fine for the user's needs.

In essence, this product is the flip-side of Transitive's translator technology for Apple which translates older Apple Power applications to run on the new x86-based Apple systems. Check out these web sites if you missed the technology introduction several years ago:

http://www.transitive.com/customers/apple and

http://www.apple.com/rosetta/

IBM and Transitive (http://transitive.com/customers/ibm) have already introduced the second release (Ver 1.2) of the IBM PowerVM Lx86 product.

Originally discussed in the press as p-AVE (for example, see an article from http://www.it-analysis.com/), IBM's product naming wizards must have been at work with the preliminary name of IBM System p Application Virtual Environment (System p AVE). Later they followed it with a newer official name under the IBM PowerVM umbrella as PowerVM Lx86 for x86 Linux applications. "p-AVE" certainly rolled off the tongue far easier than the PowerVM Lx86 name. But the PowerVM naming admit'ably fits better with the overall virtualization strengths of the Power line.

For a page full of pointers and interesting helpful hints, check out http://www-128.ibm.com/developerworks/linux/lx86/.

For example, the product is download'able from IBM's web site... gotta dig through the DeveloperWorks various registration pages - but you're looking for: IBM PowerVM Lx86 V1.2 (formerly System p Application Virtual environment or p AVE) p-ave-1.2.0.0-1.tar (8,294,400 bytes).

For a clever approach to using PowerVM Lx86, a nice demo was created which you can see on YouTube.

I really like this demo since it highlights one of several cases where the translation performance of the product isn't perceivable to an end user. The demo is an actual run, no tricks. As the demo narrative says, when you try to execute an x86 application on the Power system, the x86 executable is automatically recognized and the translation environment invoked.

The video clip goes on to show some of the highlights of the Power line where Power logical partitions can be migrated from one physical system to another. More cool stuff.

Another example of common product usage is in the world of graphing performance results. Users can check out a really nice set of charting libraries from Advanced Software Engineering (http://www.advsofteng.com/) available with the ChartDirector product. The executable run-time libraries are available for a variety of platforms, including Linux on i386, but alas, not for Linux on ppc64 systems. But when the i386 libraries are installed on a Power system running Linux with the additional PowerVM Lx86 product, Power users can use the graphing routines directly. Again, the perceptible performance differences are minimal, and the full function of the i386 routines are available to the Power users.

The IBM web site for PowerVM Virtualization Software offerings has a good description of the capabilities of the Linux product and the services available for software vendors to enable their apps for native execution while still exploiting the Power systems running Linux with their existing applications.

Keep in mind there are the normal obligatory footnotes and qualifications on what i386 applications can function under this product - check out the product web sites for that information.

Finally, as a performance team, we always tend to agonize over the corner cases which highlight the performance challenges of translating an application from one system platform base to another, and there certainly are some areas where performance can be a challenge. Java is a good example. There are too many steps of translating byte codes to executables, then those executables are translated again to execute on the Power platform, which can make for a rather poor execution path. If your Java app is a minor piece of a bigger application (the prime example is as an application installer), shrug. But whew, if you're thinking about snagging a full comprehensive Java based product and running it in in translation mode - as opposed to verifying that the Java code runs on the Power platform - I can anticipate you may be disappointed with the performance. One would've hoped that the Java world of write once, run anywhere would've panned out better than the write once test everywhere implementation.

In the meantime, if you need easy access to x86 executables and applications on your Power systems, give this a shot.

What "one thing" do you want in the Linux kernel?

noreply@blogger.com (Bill Buros) — Mon, 07 Apr 2008 18:06:00 +0000

An interesting question came by this week. If I could have one free wish, one thoughtful choice, one wise selection, what would I want in the Linux kernel?

I was of course assured that my choice didn't mean I'd get it, but it was an interesting hallway poll for the day. I suspect the answers were being pulled together as input for this week's Linux Foundation meetings happening this week in Austin Tx... lots of people in town.

The mind races.

So back to the Linux Weather Forecast. Whew. What to choose, what to choose.

http://www.linux-foundation.org/en/Linux_Weather_Forecast written up by Jonathan Corbet of lwn.net fame.

In no particular order,

A new completely fair scheduler - the promise of an updated scheduler is intriguing for performance teams. Course, on a practical level, the thoughts of rather extensive regression testing keeps popping up in my mind. Watching the progress of that effort is reassuring though, so this will be cool in the next revs of the distros.
Kernel markers - cool technology which will make it easier for tools to hook into the kernel, which of course a performance team is always interested in. We need to make it absolutely seamless and safe for a customer, on a production system, as a protected but definitely non-root user, to gather system metric information.
Memory use profiling- now this will be really nice - way too many times we're asked about memory usage of an application - which is particularly dependent on other things happening in the operating system.
Real-time enhancements - continued integration of the real-time work happening in parallel in the community. This is proving particularly helpful in the HPC cluster space as work continues to improve the deterministic behavior of the operating system.
Memory fragmentation avoidance - another longer-term project which positions the kernel for far more aggressive memory management and control of varying page sizes.
And numerous other choices... better filesystems in ext4 and btrfs, better virtualization and container support, better energy management, improvements in glibc, etc etc

But really? All of these are in play today. All of them are being polished, updated, with many creative minds at work.

So what I asked for was for the kernel and Linux community to continue strengthening the ability to help customers easily understand and improve the performance of their applications and the system. Out of the box. Across all of the components. No special adds, rpms, or kernels on their installed system.

The nice part is the Linux programmers I work with are all committed to this. So the work continues - fit, finish, and polish - and continue working with the longer term changes which are being developed.

Reboot? Real customers don't reboot.

noreply@blogger.com (Bill Buros) — Tue, 18 Mar 2008 12:47:00 +0000

To get clear, clean, and repeat'able performance results, performance teams in the labs generally don't think twice about rebooting the system and starting fresh (after all, who knows what someone has done on the system - particularly those kernel programmers).

In my experience though, to balance this automatic tendency, it's important to note that re-booting a system isn't normal behavior for real-life customers.

For example, when I suggest that a customer re-boot his/her system, there's usually a perceptible pause, sometimes even a quiet chuckle. Turns out most customers are quite happy with their systems running Linux, and simply don't consider the process of re-booting as anything normal. I'm not surprised, but it does mean that the tuning options practically available to customers have to be done dynamically and not a kernel boot option.

This aspect continues to be improved in the Linux with work across the operating system. The ability to control energy consumption, system resource usage, adding/removing CPUs, and adding/removing memory are all examples of cool things being worked on in the Linux community.

For the performance team, we're particularly interested in the ability to control things like SMT on and off (something needed on Power systems), the number of CPUs running, and minimizing kernel memory fragmentation.

Some pieces are there, some are emerging, and some are being invested in. I'll hunt some examples down and post them here in the coming days.

RHEL5.1 exploits POWER6

noreply@blogger.com (Bill Buros) — Tue, 12 Feb 2008 13:02:00 +0000

Whew. The months tick by quickly these days. Back in October 2007, Red Hat released a service update which officially supported and exploited the IBM POWER6 systems. The Red Hat exploitation code had been worked on by LTC programmers from around the world - providing a Linux OS version that handily and easily provides very nice performance for enterprise customers in production mode.

The performance teams submitted a series of SPEC.org (see the Appendix in the white paper link below for the list) and single-system Linpack publishes at the time, which have since been reviewed, approved, and published. One of the things we really like to focus on is the "whole software stack" when performance is looked at. So a short white paper was written up to explain the various software pieces and how they were tuned. To me this is far more interesting than the bottom-line metrics and the various sparring leadership claims that bounce around the industry. 'Course, it's always fun to have those leadership claims while they last.

The paper - an interesting mix of marketing introductions and then some technical details - really is intended to focus on the basics of leveraging the platform specific compilers (IBM's XL C/C++ and Fortran compilers), platform specific tuned math libraries (ESSL), a Linux open source project called libhugetlbfs, MicroQuill's clever SmartHeap product, FDPR-pro - which I've posted on earlier, IBM's Java, all running examples on the IBM System p 570 system.

Following these activities and working with various customers and interested programmers, there's some interesting work going on with using the oprofile analysis tools on Linux - especially as a non-root user - and working on understanding edge cases where the performance of a workload isn't purely predictable and deterministic. What's interesting is the possible performance variability of a workload when memory being used by a program isn't regularly placed local to each executing core (a classic NUMA-type issue), especially in the case of varying page sizes supported by the Linux operating systems.

I'll post some examples of using oprofile in different ways this month, and also an example of a Fortran program which can show varying run-times and how to address that.

Winning with Linux on new Power systems

noreply@blogger.com (Bill Buros) — Fri, 01 Feb 2008 13:29:00 +0000

IBM just announced two new POWER6 systems, the IBM System p 520 and IBM System p 550. In essence, p 520 has up to 4 cores of a 4.2GHz POWER6 processor and up to 64GB memory, while the p 550 has up to 8 cores and up to 256GB memory. The two systems are sweet little systems, and I recommend checking them out.

Elisabeth Stahl has a good summary blog post on the leadership publishes submitted earlier this week for these new systems. I point to her blog since she nicely has all of the requisite disclaimers and pointers to the data points submitted to SPEC.org and with SAP this week. It takes a couple of weeks for the various review processes to complete, so it'll be easier to comment on these once they've been reviewed, approved, and published. I have found that Linux programmers like to see the published files and walk through the specific details. Show me.

The cool part is Linux on Power continues to be a parity player in the leadership performance metrics for POWER6 customers (in this case using examples for Linpack, SPEC CPU2006, SPECjbb2005, and SAP). There are summaries of some of the submitted bottom-line numbers for AIX and Linux for the p 520 and the p 550 on IBM's website.

There's a short paper which should be out soon that discusses the simple steps and software products that can be used on Linux on Power to achieve the best performance for POWER6 using Linux. It's based on similar publishes done last October and uses actual results that are published on the SPEC.org website.

Workloads: Standards? Customer? Proxy?

noreply@blogger.com (Bill Buros) — Mon, 21 Jan 2008 13:23:00 +0000

So what's the "Best Workload" to measure performance with? Industry standard workloads? Real-life customer applications? Proxy workloads developed to mimic customer applications? Micro-benchmarks? Marketing inspired metrics?

Well, it depends. Each has a purpose. Each has its drawbacks.

Let's review.

Industry standard workloads, typified by the wide breadth of SPEC.org workloads, are benchmarks painstakingly agreed-to by industry committees. The workloads are specifically intended for cross-industry comparisons of systems and software stacks, which can make the committees an "interesting" position to be on. I'm not sure I would have the patience and tact needed to function effectively in that realm.

In the Linux community, I have occasionally found some resistance to the world of industry standard workloads, with quick examples of things like SPECcpu2006, SPECjbb2005, SPECweb2005, and many others. Trying not to over-generalize, but sometimes the community views unique performance improvements targeted at these workloads as "benchmark specials". Ok, sometimes I have seen some members of the community get more than a little excited about a change being proposed which were presented as "we gotta have this for xxx benchmark".

I find this attitude frustrating at times, since the workloads usually were developed with the specific intent of addressing how customers use their systems. But note to self, I've discovered that when I can find a real-life customer example of the same workload, the proposed change (fix or improvement) is enthusiastically embraced. With that small change in how a problem is proposed, I've been very impressed with how well the broader Linux community wants to improve performance for real customers.

(by the way, quick aside, some industry standard benchmarks do not pan out because really clever "benchmark specials" DO emerge, and the comparisons then can lose any customer relevant importance)

Real-life customer applications. So why not test with "real" applications more often? While some customers do have specialized example applications that they can share with vendors, usually the customer application represents the "jewels" of their business, so it's not at all surprising that they have no interest in sharing. So some time and energy is spent working to understand the workload characteristics, which can then be leveraged in identifying proxy workloads. Most performance teams have a good set of sample workloads which can serve as examples to look at improvements and bottlenecks.

On occasion, a team can run into a customer application which is unique and not nicely covered by their standard suite. Valerie Henson ran into this a while back, and created a nice little program called eBizzy, which eventually ended up on SourceForge with a third release just made available in January 2008. Numerous performance problems have been uncovered with programs like this, with fixes flowing back into the mainline development trees for Linux and the software stacks. There are many other examples floating around the web.

Proxy Workloads. Creating workloads to mimic customer applications is what I loosely call proxy workloads. These are handy when they are available as source code and unencumbered with various specialized licenses. The workloads can be modified to address what the performance team is focusing on, and more importantly, they're available for the programmers in the community who are trying to reproduce the problem and prototype the fix. Using proxy workloads is a good example of where the combination of customer, performance teams, and programmers can effectively address performance issues with application programs..

Micro-benchmarks. Some people use very small, simplistic, programs for performance assessments. These targeted programs are often packaged together and are fairly quick and easy to execute and get simplistic answers from. These are practically more useful for digging into specific areas of a system implementation, but when used to generalize an overall system's performance, can cause headaches and mis-understandings.

I have seen several over-zealous system admins at some companies take a simple single-thread single-process integer micro-benchmark, run it on his new 1U x86 server with 1GB memory, and then run it on a new very large system (anyone's, this is not vendor exclusive), and make the obvious conclusion that new little 1U server is the new preferred server system for massively complex workloads planned for the datacenter. Somehow they completely miss the point that only one thread is running on the larger system with many cores and lots of memory, thus 98% idle. It'd be funny if these people were just sitting in labs, but it's amazing how an authoritatively written email inside a company can create so much extra work in refuting and explaining why the person who wrote it, perhaps, just perhaps, was an idiot.

That said, micro-benchmarks do play a critical role with performance teams, simply because they do in fact highlight particular aspects of a system, which allows some very focused testing and improvement activity. Improvements have to be made in a bigger context though, since you may tweak one part of the system, only to hurt another part.

Marketing inspired metrics. Finally, I'll close with a quick discussion on what I term the "marketing inspired metrics". While some companies in the marketplace are really good at this, I have to chuckle and admit I'm not convinced this is one of IBM's core strengths, or for that matter, Linux.

The ability to make up new performance metrics, highlighting vague workloads in the absence of any concrete comparable system criteria, and formulate an entire marketing campaign around new "performance terminology" is an art form. I shudder when marketing teams get that gleam in their eye and start talking about performance. Uh-oh.

It's so much safer when we tell them "This is Leadership", or they simply task us with "Make it Leadership". Leadership of course is always how you phrase it, and what you choose to compare to, but at least that's usually a facts-based discussion.

Focus on repeat'able data. So performance teams focus on real data, which is supported by analysis and assessments of the data, watching for and correcting the common mistakes in gathering data from a system. In Linux performance teams I've been involved with, there is a fanatical dedication to automating the process of gathering and scrubbing the data, building in safe-guards to catch the common mistakes, so that when work is done on a workload, be it industry standard, a customer application, a proxy workload, or a micro-benchmark, we're focusing on finding and fixing the problems, not on the steps of gathering the data.

And I always try to avoid those marketing inspired metrics, that's for the marketing teams.

Performance Plus

noreply@blogger.com (Bill Buros) — Thu, 17 Jan 2008 19:33:00 +0000

“Higher system performance—the traditional hallmark of UNIX evolution—is still critical, but no longer sufficient. New systems must deliver higher levels of performance plus availability plus efficiency. We call this Performance plus.” —Ross A. Mauri, general manager, IBM System p

While more AIX-centric than I would've hoped for from an article for System p, I did see an interesting article from an IBM Systems Magazine which touches on some of the challenges emerging for performance teams in today's marketplace. The article focuses on application availability and energy efficiencies as appropriate parallel focus items in addition to performance metrics and benchmarks. No surprise there, but it is interesting to see IBM's senior executives emerging with a new term - "Performance Plus" - which generally means we'll be living that as a mantra within a month or two.

The challenge comes in emerging with new metrics to numerically quantify the balancing act of system/application availability, energy usage (across cooling, power draw, and peak energy demands), increasingly virtualized servers, and the classic how fast does my application run?

If we could figure out how to make "metrics" a series of coding requests in open-source projects, we could get things going across various mailing lists and get more people cranking out code, ideas, and brand-new metrics. In the meantime, I guess we'll start getting more creative with our Linux brethren world-wide who are working on exactly all of these issues in real-life scenarios. The balancing act is already in progress, the metrics will need to emerge over time.

Executable magic - FDPR-pro

noreply@blogger.com (Bill Buros) — Fri, 11 Jan 2008 23:16:00 +0000

One of the products we use quite frequently is the FDPR-pro product available from IBM . It's one of those "magic" products which allow you to easily optimize a program after it has been compiled and linked.

The program, Feedback Directed Program Restructuring, is available for Linux on Power programs from IBM's alphaWorks web site. Last week the Haifa development team updated the download'able version with a new version - Version 5.4.0.17 - which is what caught my eye.

The official name is Post-Link Optimization for Linux on POWER, which makes for an acronym that has never really caught on - PLOLOP - and various other strange permutations. I seem to recall that the FDPR-pro name had a name conflict somewhere in the world, but that's the name I've been familiar with for years.

The post-link optimization has been used with database engine products, big ISV products, CPU intensive benchmarks and customer workloads. Borrowing shamelessly from the alphaWorks web site, this product is a performance tuning utility which can improve the execution time of user-level application programs. It does this with a multi-step process of running the program in a commonly used mode, where FDPR gathers run-time profiles and uses those profiles to modify/optimize the executable image.

The clever part is that FDPR can be made aware of cache line sizes and specific system characteristics and optimize (if desired) for execution on a particular set of systems. The daunting aspect of FDPR is the number of possible optimizations. As happens with compilers, there are lots of optimizations and areas to focus on.

Using FDPR can become a natural part of the build process of an application. Build, link, then run the extra steps needed by FDPR. FDPR is a supported product and re-structured executables can be supported in normal business commercial environments. Sometimes the performance gains are minimal, and in many cases we'll see performance gains in the 10% to 12% range. More is possible, but it really depends on the application. In rare cases the changes will cause an executable to core-dump, but we've found the team in Haifa extremely helpful and they usually are able to fix problems quickly.

A product to help with the daunting number of FDPR options is another alphaWorks product called ESTO - Expert System for Tuning Optimizations. ESTO automates the process of identifying options and tuning with FDPR for Linux on Power. As explained on the ESTO project page, the approach leverages "iterative compilation" and uses an "adaptive genetic algorithm with exponential population reduction". I have a feeling that the team that built the product description had fun listing the technologies and the approaches used.

In the near future, we're going to get some examples posted up here and on some LoP wikis.

servicelog and hardware performance

noreply@blogger.com (Bill Buros) — Wed, 12 Dec 2007 14:32:00 +0000

Mike Strosaker put up a interesting blog over the weekend covering Servicelog which I caught on Planet-LTC. As I was reading it, it reminded me of just how critically important easy service'ability is for performance work.

When we do performance work in the Linux labs, we usually have an interesting mix of systems, some newer than new (also known as engineering models), some customer brand-new, some fairly ancient. For the systems we use directly, we have a good feel for the general ongoing performance characteristics of that system. We regularly run performance regressions of standard workloads watching for changes in the Linux software stack. Since we do this on a regular basis, any performance blip (sometimes an unexpected software regression, but usually a nice Linux improvement which is working its way through the process) is seen and we can go look at it. The side effect of this is we also can catch the rare occasion when something has happened on the hardware side.

One of the interesting aspects of the POWER line is the continued strong focus and evolution of the RAS characteristics of the systems. "RAS" can often get lost in the marketing collateral of new systems, and is usually ignored by the programmer community, but it really is pretty important to customers. For us, what this means on a practical level is that on rare occasions, the POWER systems will automagically detect possibly failing components and will disable those components if and when necessary, logging the errors.

For those who are constantly searching for more materials to read, a comprehensive description of the latest POWER RAS characteristics is available at http://www-03.ibm.com/systems/p/hardware/whitepapers/ras.html.
The Appendix A of the White Paper lists the Linux support provided by the SUSE and Red Hat Linux Distributions. A lot of people in the Linux teams around the world have worked on RAS support and have tested this thoroughly. The breadth of the coverage is pretty impressive and continues to be improved.
This coverage is key because it nicely highlights the "enterprise readiness" of Linux for customers.

So why is this important to performance?

Hardware failure examples are rare. So it's something we usually discover after spending way too much time doing software analysis, when we should have checked the service/error logs first. Eventually we get the right performance data which highlights the problem, and THEN we go check the service logs, which confirms what we determined. After we check the service log, we usually look at each other and state the obvious: "Hey, YOU should've checked the service log".

An example over the last year was with an older Power 5 system which was being used by a software vendor to do software performance testing. Unbeknownst to us, the system had cleverly detected that one of the L3 caches was failing, logged it appropriately, and then disabled the failing L3 cache so that the system could continue operating. The performance aspects were REALLY strange. We were running a multi-process multi-threaded workload that max'ed out the whole system, but the results were not balanced which made it look like a scheduler bug.

It was a long story, but eventually we checked the L3 cache sizes and confirmed that one of the L3 caches was "missing". We should've seen the following.... which shows the two 36MB L3-caches on this system...

cd /proc/device-tree/cpus/ find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \; 0000000 0240 0000 0000004 0000000 0240 0000 0000004

but instead we saw something like the following (this being from memory)... the L3 cache size of one of the caches was zero.

cd /proc/device-tree/cpus/ find . -wholename "./l3-cache*/i-cache-size" -exec hexdump '{}' \; 0000000 0240 0000 0000004 0000000 0000 0000 0000004

The system config confirmed the behavior we were seeing. Once we got the piece fixed, the strange results were taken care of.

Anyway, circling back to Mike's servicelog post, that looks like an easy piece of software to try out. I popped it onto one of our SLES 10 sp1 servers in the lab, but it needs a Berkeley database library to link in - the website blithely says copy libdb-4.2.a into the source directory - and I don't have time this week to hunt that part down. I've got the .so file, and there's probably some easy way to bridge the two, but I'll try it out early in the new year. If the serviceable event process works as advertised, this would be nice to pop on our systems, and use it to highlight serviceable events on the system. Even better, what would be really cool is for it to send me a polite email that informs me about the error, but I suspect this needs some wrapper systems management software.

The key message for emerging performance analysts is to make sure the hardware is doing what it's supposed to be doing. A related topic in the future will cover the occasional disconnects between what someone tells you was ordered for the hardware you're testing on, and what is really in the hardware system you're testing on. Same story... you need to be sure you know what you're really working with.

Green 500 ?!

noreply@blogger.com (Bill Buros) — Mon, 26 Nov 2007 14:13:00 +0000

Interesting. Flippingly remarked in my last post that I wondered what the energy/thermal ratings were for the Top 500 clustered systems, and then this morning stumbled on the Green 500 list at http://green500.org.

My first thought was "how in the world do you measure power consumption across so many clustered systems??" And it appears that they calculate the overall power consumption depending on what can be measured, then the overall metric is calculated. The web site even nicely provides a tutorial paper on how to calculate / determine your power consumption.

http://green500.org/docs/tutorials/tutorial.pdf

More stuff to read and try this week. Time to pull the power meter over to a small Linpack test system running the latest RHEL 5.1 release where we recently pushed a number of publishes out and see what's happening. We regularly play with Linpack on the single SMP servers to make sure they scale fairly linearly (ie: 4-core to 8-core to 16-core) and this may be a good way to apply some power consumption metrics to the performance metrics. Linpack is good because it's fairly steady-state for a relatively long period (depending of course on how big you define the problem size to be solved).

Naturally... what, when, and how to measure watts and thermal and energy consumption for systems under test are going through many debates and discussions these days in the industry these days. A whole new dimension of being able to say "well, it depends" when asked about performance and the trade-offs. If you get a chance, watch what happens with SPEC.org's SPECpower initial benchmark (at http://www.spec.org/specpower/ ). This initial benchmark is focused on CPU centric workloads, but more dimensions are undoubtedly coming.