Strange runtime behaviour, job time fluctuates

Discussion:

(too old to reply)

y***@dodgeit.com

2007-08-10 03:58:55 UTC

I have a completely deterministic executable on a idle machine, and
yet I'm getting some wildly fluctuating running times.

The executable in question is a SPEC benchmark, and so should be
completely deterministic. The machine is idle, no one else is logged
on, and the benchmark gets 99% of the CPU for the duration. The
machine is a CELL Blade running Fedora 7 and the benchmark is single-
threaded and running completely on the PPE, although I've seen this on
POWER5 as well. I'm using the time command to get these measurements.

My problem is simple, I have no explanation for the variable running
time, which fluctuates pretty drastically. The last two runs gave me
7m24s and 9m19s. Does anyone have any idea why this would happen? The
only significant difference between the two runs was in involuntary
context switches (457 vs 564), which suggests that for some reason one
run is getting much less work done per time slice than the other...
and I have no clue why. Things like # of page faults and voluntary
context switches are the same. Now, I don't expect the exact same
running time every invocation, since the rest of the machine isn't
free from outside influences, but like I said, I've made sure that
disturbances to the machine have been minimized and so I don't expect
over 2 minutes difference in running time. Typically I most of the
times are distributed closely around 7m and 9m, for whatever reason.

Have I missed something? I've considered cache, the SMP nature of the
PPE, and scheduling, but I don't see how those might contribute to
this problem. If anyone has any ideas I'd love to hear them.

Paul Russell

2007-08-10 04:50:37 UTC

Permalink

Post by y***@dodgeit.com
I have a completely deterministic executable on a idle machine, and
yet I'm getting some wildly fluctuating running times.
The executable in question is a SPEC benchmark, and so should be
completely deterministic. The machine is idle, no one else is logged
on, and the benchmark gets 99% of the CPU for the duration. The
machine is a CELL Blade running Fedora 7 and the benchmark is single-
threaded and running completely on the PPE, although I've seen this on
POWER5 as well. I'm using the time command to get these measurements.
My problem is simple, I have no explanation for the variable running
time, which fluctuates pretty drastically. The last two runs gave me
7m24s and 9m19s. Does anyone have any idea why this would happen? The
only significant difference between the two runs was in involuntary
context switches (457 vs 564), which suggests that for some reason one
run is getting much less work done per time slice than the other...
and I have no clue why. Things like # of page faults and voluntary
context switches are the same. Now, I don't expect the exact same
running time every invocation, since the rest of the machine isn't
free from outside influences, but like I said, I've made sure that
disturbances to the machine have been minimized and so I don't expect
over 2 minutes difference in running time. Typically I most of the
times are distributed closely around 7m and 9m, for whatever reason.
Have I missed something? I've considered cache, the SMP nature of the
PPE, and scheduling, but I don't see how those might contribute to
this problem. If anyone has any ideas I'd love to hear them.

A couple of thoughts:

- I don't know how many CPUs you have on each blade, but try using
taskset to bind your process to a specific CPU/core.

- Is it possible that your clock speed is getting throttled for some
reason ? Some blades have clock speed control which ramps up and down
with load, and which can also ramp down if the CPU internal temperature
gets too high. Does cpufreq-info tell you anything ?

Paul

Robert M. Riches Jr.

2007-08-10 04:52:44 UTC

Permalink

Two things to check in case they _MIGHT_ be the cause:

- CPU temperature that might cause throttling, if those
CPUs do thermal throttling.

- If these CPUs have the inverted page tables like the
PowerPC (and Power?) architecture, might there be some
odd effect related to that?

--
Robert Riches
***@verizon.net
(Yes, that is one of my email addresses.)

Anton Ertl

2007-08-10 07:46:16 UTC

Permalink

Post by y***@dodgeit.com
My problem is simple, I have no explanation for the variable running
time, which fluctuates pretty drastically. The last two runs gave me
7m24s and 9m19s. Does anyone have any idea why this would happen? The
only significant difference between the two runs was in involuntary
context switches (457 vs 564), which suggests that for some reason one
run is getting much less work done per time slice than the other...
and I have no clue why.

They are getting pretty much the same length of time slice (slightly
more than a second). It's not surprising if a longer running job gets
preempted more often.

Concerning your problem, I have no idea. For a normal CPU I would be
thinking about effects from the MMU and caches (mapping several hot
pages to the same cache set, causing increased conflict misses), but
AFAIK a PPE has no MMU and no cache, so that can't be it.

- anton

--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Cedric Roux

2007-08-10 12:31:26 UTC

Permalink

Post by y***@dodgeit.com
I have a completely deterministic executable on a idle machine, and
yet I'm getting some wildly fluctuating running times.

[snip]

Post by y***@dodgeit.com
Have I missed something? I've considered cache, the SMP nature of the
PPE, and scheduling, but I don't see how those might contribute to
this problem. If anyone has any ideas I'd love to hear them.

1 - Maybe cron jobs?

2 - Is the computer connected to the Internet? Processing
network packets might slow it down.

tortoise

2007-08-14 09:15:09 UTC

Permalink

you could boot single user.

Post by y***@dodgeit.com
on, and the benchmark gets 99% of the CPU for the duration. The
machine is a CELL Blade running Fedora 7 and the benchmark is single-

which kernel is that.

is your kernel configured to allow processor scaling of cpu frequency.
many do this to save energy. it could be just sitting it slows down.
then it has to speed up again. I have heard people complain on
Debian list sometimes their macs had been seeming to be sticky
around the lower cycling.

Post by y***@dodgeit.com
threaded and running completely on the PPE, although I've seen this on
POWER5 as well. I'm using the time command to get these measurements.
My problem is simple, I have no explanation for the variable running
time, which fluctuates pretty drastically. The last two runs gave me
7m24s and 9m19s. Does anyone have any idea why this would happen? The

any code will run faster the second time than the first time, because
instructions which formerly had to be loaded from disk are now
cached in memory.

BTW all the cell systems i have seen have max 512MB RAM and
some have only 256. (i have not used them, I have been shopping
around to maybe buy one). this is considerably less than usual,
so I wonder if they have more sophisticated memory management
control.

if I had time (if i had a cell system to test) I might try another
distro/ kernel or two.

Post by y***@dodgeit.com
only significant difference between the two runs was in involuntary
context switches (457 vs 564), which suggests that for some reason one
run is getting much less work done per time slice than the other...
and I have no clue why. Things like # of page faults and voluntary
context switches are the same. Now, I don't expect the exact same
running time every invocation, since the rest of the machine isn't
free from outside influences, but like I said, I've made sure that
disturbances to the machine have been minimized and so I don't expect
over 2 minutes difference in running time. Typically I most of the
times are distributed closely around 7m and 9m, for whatever reason.

actually that difference is only 22%. it is worrisome though.