123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245 |
- In June of 1997, Margo Seltzer and Aaron Brown published a paper in
- Sigmetrics called "Operating System Benchmarking in the Wake of Lmbench:
- A Case Study of the Performance of NetBSD on the Intel x86 Architecture".
- This papers claims to have found flaws in the original lmbench work.
- With the exception of one bug, which we have of course fixed, we find
- the claims inaccurate, misleading, and petty. We don't understand
- what appears to be a pointless attack on something that has obviously
- helped many researchers and industry people alike. lmbench was warmly
- received and is widely used and referenced. We stand firmly behind the
- work and results of the original benchmark. We continue to improve and
- extend the benchmark. Our focus continues to be on providing a useful,
- accurate, portable benchmark suite that is widely used. As always, we
- welcome constructive feedback.
- To ease the concerns of gentle benchmarkers around the world, we have
- spent at least 4 weeks reverifying the results. We modified lmbench to
- eliminate any effects of
- . clock resolution
- . loop overhead
- . timing interface overhead
- Our prediction was that that this would not make any difference and our
- prediction was correct. All of the results reported in lmbench 1.x are
- valid except the file reread benchmark which may be 20% optimistic on
- some platforms.
- We've spent a great deal of time and energy, for free, at the expense
- of our full time jobs, to address the issues raised by hbench. We feel
- that we were needlessly forced into a lose/lose situation of arguing
- with a fellow researcher. We intend no disrespect towards their work,
- but did not feel that it was appropriate for what we see as incorrect
- and misleading claims to go unanswered.
- We wish to move on to the more interesting and fruitful work of extending
- lmbench in substantial ways.
- Larry McVoy & Carl Staelin, June 1997
- --------------------------------------------------------------------------
- Detailed responses to their claims:
- Claim 1:
- "it did not have the statistical rigor and self-consistency
- needed for detailed architectural studies"
- Reply:
- This is an unsubstantiated claim. There are no numbers which back
- up this claim.
- Claim 2:
- "with a reasonable compiler, the test designed to read and touch
- data from the file system buffer cache never actually touched
- the data"
- Reply:
-
- Yes, this was a bug in lmbench 1.0. It has been fixed.
- On platforms such as a 120 Mhz Pentium, we see change of a 20%
- in the results, i.e., without the bug fix it is about 20% faster.
- Claim 3:
- This is a multi part claim:
- a) gettimeofday() is too coarse.
- Reply:
-
- The implication is that there are number of benchmarks in
- lmbench that finish in less time than the clock resolution
- with correspondingly incorrect results. There is exactly one
- benchmark, TCP connection latency, where this is true and that
- is by design, not by mistake. All other tests run long enough
- to overcome 10ms clocks (most modern clocks are microsecond
- resolution).
- Seltzer/Brown point out that lmbench 1.x couldn't accurately
- measure the L1/L2 cache bandwidths. lmbench 1.x didn't attempt
- to report L1/L2 cache bandwidths so it would seem a little
- unreasonable to imply inaccuracy in something the benchmark
- didn't measure. It's not hard to get this right by the way, we
- do so handily in lmbench 2.0.
- b) TCP connection latency is reported as 0 on the DEC Alpha.
- Reply:
- We could have easily run the TCP latency connection benchmark in
- a loop long enough to overcome the clock resolution. We were,
- and are, well aware of the problem on DEC Alpha boxes. We run
- only a few interations of this benchmark because the benchmark
- causes a large number of sockets to get stuck in TIME_WAIT,
- part of the TCP shutdown protocol. Almost all protocol stacks
- degrade somewhat in performance when there are large numbers of
- old sockets in their queues. We felt that showing the degraded
- performance was not representative of what users would see.
- So we run only for a small number (about 1000) interations and
- report the result. We would not consider changing the benchmark
- the correct answer - DEC needs to fix their clocks if they wish
- to see accurate results for this test.
- We would welcome a portable solution to this problem. Reading
- hardware specific cycle counters is not portable.
- Claim 4:
- "lmbench [..] was inconsistent in its statistical treatment of
- the data"
- ...
- "The most-used statistical policy in lmbench is to take the
- minimum of a few repetitions of the measurement"
- Reply:
- Both of these claims are false, as can be seen by a quick inspection
- of the code. The most commonly used timing method (16/19 tests
- use this) is
- start_timing
- do the test N times
- stop_timing
- report results in terms of duration / N
-
- In fact, the /only/ case where a minimum is used is in the
- context switch test.
- The claim goes on to try and say that taking the minimum causes
- incorrect results in the case of the context switch test.
- Another unsupportable claim, one that shows a clear lack of
- understanding of the context switch test. The real issue is cache
- conflicts due to page placement in the cache. Page placement is
- something not under our control, it is under the control of the
- operating system. We did not, and do not, subscribe to the theory
- that one should use better ``statistical methods'' to eliminate
- the variance in the context switch benchmark. The variance is
- what actually happened and happens to real applications.
- The authors also claim "if the virtually-contiguous pages of
- the buffer are randomly assigned to physical addresses, as they
- are in many systems, ... then there is a good probability that
- pages of the buffer will conflict in the cache".
- We agree with the second part but heartily disagree with
- the first. It's true that NetBSD doesn't solve this problem.
- It doesn't follow that others don't. Any vendor supplied
- operating system that didn't do this on a direct mapped L2
- cache would suffer dramatically compared to it's competition.
- We know for a fact that Solaris, IRIX, and HPUX do this.
- A final claim is that they produced a modified version of the
- context switch benchmark that does not have the variance of
- the lmbench version. We could not support this. We ran that
- benchmark on an SGI MP and saw the same variance as the original
- benchmark.
- Claim 5:
- "The lmbench bandwidth tests use inconsistent methods of accessing
- memory, making it hard to directly compare the results of, say
- memory read bandwidth with memory write bandwidth, or file reread
- bandwidth with memory copy bandwidth"
- ...
- "On the Alpha processor, memory read bandwidth via array indexing
- is 26% faster than via pointer indirection; the Pentium Pro is
- 67% faster when reading with array indexing, and an unpipelined
- i386 is about 10% slower when writing with pointer indirection"
- Reply:
- In reading that, it would appear that they are suggesting that
- their numbers are up to 67% different than the lmbench numbers.
- We can only assume that this was delibrately misleading.
- Our results are identical to theirs. How can this be?
- . We used array indexing for reads, so did they.
- They /implied/ that we did it differently, when in fact
- we use exactly the same technique. They get about
- 87MB/sec on reads on a P6, so do we. We challenge
- the authors to demonstrate the implied 67% difference
- between their numbers and ours. In fact, we challenge
- them to demonstrate a 1% difference.
- . We use pointers for writes exactly because we wanted
- comparable numbers. The read case is a load and
- an integer add per word. If we used array indexing
- for the stores, it would be only a store per word.
- On older systems, the stores can appear to go faster
- because the load/add is slower than a single store.
- While the authors did their best to confuse the issue, the
- results speak for themselves. We coded up the write benchmark
- our way and their way. Results for a Intel P6:
- pointer array difference
- L1 $ 587 710 18%
- L2 $ 414 398 4%
- memory 53 53 0%
-
- Claim 5a:
- The harmonic mean stuff.
- Reply:
- They just don't understand modern architectures. The harmonic mean
- theory is fine if and only if the process can't do two things at
- once. Many modern processors can indeed do more than one thing at
- once, the concept is known as super scalar, and can and does include
- load/store units. If the processor supports both outstanding loads
- and outstanding stores, the harmonic mean theory fails.
- Claim 6:
- "we modified the memory copy bandwidth to use the same size
- data types as the memory read and write benchmark (which use the
- machine's native word size); originally, on 32-bit machines, the
- copy benchmark used 64-bit types whereas the memory read/write
- bandwidth tests used 32- bit types"
- Reply:
- The change was to use 32 bit types for bcopy. On even relatively
- modern systems, such as a 586, this change has no impact - the
- benchmark is bound by memory sub systems. On older systems, the
- use of multiple load/store instructions, as required for the smaller
- types, resulted in lower results than the meory system could produce.
- The processor cycles required actually slow down the results. This
- is still true today for in cache numbers. For example, an R10K
- shows L1 cache bandwidths of 750MB/sec and 377MB/sec with 64 bit
- vs 32 bit loads. It was our intention to show the larger number and
- that requires the larger types.
-
- Perhaps because the authors have not ported their benchmark to
- non-Intel platforms, they have not noticed this. The Intel
- platform does not have native 64 bit types so it does two
- load/stores for what C says is a 64 bit type. Just because it
- makes no difference on Intel does not mean it makes no difference.
|