Wednesday, October 22, 2008

Hey! Who's stealing my CPU cycles?!

I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.
  • The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.
In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 14578432 408768 943616 0 0 0 0 2 5 0 0 100 0 0
0 0 0 14578368 408768 943616 0 0 0 0 25 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 32 12 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 0 21 45 0 0 100 0 0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.
  • Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 11 34 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574656 408704 943488 0 0 0 0 10 34 50 0 0 0 50
For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

1 comment:

Anonymous said...

Very helpful! Thanks for clearing this up. I couldn't make heads or tails of these CPU numbers in top. The box I'm looking at should be running at ~100% utilization at all times, but only seemed to be running at 50%... with 50% of every CPU "stolen."

One of many bloggers.

Bill Buros

Bill leads an IBM Linux performance team in Austin Tx (the only place really to live in Texas). The team is focused on IBM's Power offerings (old, new, and future) working with IBM's Linux Technology Center (the LTC). While the focus is primarily on Power systems, the team also analyzes and improves overall Linux performance for IBM's xSeries products (both Intel and AMD) , driving performance improvements which are both common for Linux and occasionally unique to the hardware offerings.

Performance analysis techniques, tools, and approaches are nicely common across Linux. Having worked for years in performance, there are still daily reminders of how much there is to learn in this space, so in many ways this blog is simply another vehicle in the continuing journey to becoming a more experienced "performance professional". One of several journeys in life.

The Usual Notice

The postings on this site are my own and don't necessarily represent IBM's positions, strategies, or opinions, try as I might to influence them.