hbench-REBUTTAL 9.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245
  1. In June of 1997, Margo Seltzer and Aaron Brown published a paper in
  2. Sigmetrics called "Operating System Benchmarking in the Wake of Lmbench:
  3. A Case Study of the Performance of NetBSD on the Intel x86 Architecture".
  4. This papers claims to have found flaws in the original lmbench work.
  5. With the exception of one bug, which we have of course fixed, we find
  6. the claims inaccurate, misleading, and petty. We don't understand
  7. what appears to be a pointless attack on something that has obviously
  8. helped many researchers and industry people alike. lmbench was warmly
  9. received and is widely used and referenced. We stand firmly behind the
  10. work and results of the original benchmark. We continue to improve and
  11. extend the benchmark. Our focus continues to be on providing a useful,
  12. accurate, portable benchmark suite that is widely used. As always, we
  13. welcome constructive feedback.
  14. To ease the concerns of gentle benchmarkers around the world, we have
  15. spent at least 4 weeks reverifying the results. We modified lmbench to
  16. eliminate any effects of
  17. . clock resolution
  18. . loop overhead
  19. . timing interface overhead
  20. Our prediction was that that this would not make any difference and our
  21. prediction was correct. All of the results reported in lmbench 1.x are
  22. valid except the file reread benchmark which may be 20% optimistic on
  23. some platforms.
  24. We've spent a great deal of time and energy, for free, at the expense
  25. of our full time jobs, to address the issues raised by hbench. We feel
  26. that we were needlessly forced into a lose/lose situation of arguing
  27. with a fellow researcher. We intend no disrespect towards their work,
  28. but did not feel that it was appropriate for what we see as incorrect
  29. and misleading claims to go unanswered.
  30. We wish to move on to the more interesting and fruitful work of extending
  31. lmbench in substantial ways.
  32. Larry McVoy & Carl Staelin, June 1997
  33. --------------------------------------------------------------------------
  34. Detailed responses to their claims:
  35. Claim 1:
  36. "it did not have the statistical rigor and self-consistency
  37. needed for detailed architectural studies"
  38. Reply:
  39. This is an unsubstantiated claim. There are no numbers which back
  40. up this claim.
  41. Claim 2:
  42. "with a reasonable compiler, the test designed to read and touch
  43. data from the file system buffer cache never actually touched
  44. the data"
  45. Reply:
  46. Yes, this was a bug in lmbench 1.0. It has been fixed.
  47. On platforms such as a 120 Mhz Pentium, we see change of a 20%
  48. in the results, i.e., without the bug fix it is about 20% faster.
  49. Claim 3:
  50. This is a multi part claim:
  51. a) gettimeofday() is too coarse.
  52. Reply:
  53. The implication is that there are number of benchmarks in
  54. lmbench that finish in less time than the clock resolution
  55. with correspondingly incorrect results. There is exactly one
  56. benchmark, TCP connection latency, where this is true and that
  57. is by design, not by mistake. All other tests run long enough
  58. to overcome 10ms clocks (most modern clocks are microsecond
  59. resolution).
  60. Seltzer/Brown point out that lmbench 1.x couldn't accurately
  61. measure the L1/L2 cache bandwidths. lmbench 1.x didn't attempt
  62. to report L1/L2 cache bandwidths so it would seem a little
  63. unreasonable to imply inaccuracy in something the benchmark
  64. didn't measure. It's not hard to get this right by the way, we
  65. do so handily in lmbench 2.0.
  66. b) TCP connection latency is reported as 0 on the DEC Alpha.
  67. Reply:
  68. We could have easily run the TCP latency connection benchmark in
  69. a loop long enough to overcome the clock resolution. We were,
  70. and are, well aware of the problem on DEC Alpha boxes. We run
  71. only a few interations of this benchmark because the benchmark
  72. causes a large number of sockets to get stuck in TIME_WAIT,
  73. part of the TCP shutdown protocol. Almost all protocol stacks
  74. degrade somewhat in performance when there are large numbers of
  75. old sockets in their queues. We felt that showing the degraded
  76. performance was not representative of what users would see.
  77. So we run only for a small number (about 1000) interations and
  78. report the result. We would not consider changing the benchmark
  79. the correct answer - DEC needs to fix their clocks if they wish
  80. to see accurate results for this test.
  81. We would welcome a portable solution to this problem. Reading
  82. hardware specific cycle counters is not portable.
  83. Claim 4:
  84. "lmbench [..] was inconsistent in its statistical treatment of
  85. the data"
  86. ...
  87. "The most-used statistical policy in lmbench is to take the
  88. minimum of a few repetitions of the measurement"
  89. Reply:
  90. Both of these claims are false, as can be seen by a quick inspection
  91. of the code. The most commonly used timing method (16/19 tests
  92. use this) is
  93. start_timing
  94. do the test N times
  95. stop_timing
  96. report results in terms of duration / N
  97. In fact, the /only/ case where a minimum is used is in the
  98. context switch test.
  99. The claim goes on to try and say that taking the minimum causes
  100. incorrect results in the case of the context switch test.
  101. Another unsupportable claim, one that shows a clear lack of
  102. understanding of the context switch test. The real issue is cache
  103. conflicts due to page placement in the cache. Page placement is
  104. something not under our control, it is under the control of the
  105. operating system. We did not, and do not, subscribe to the theory
  106. that one should use better ``statistical methods'' to eliminate
  107. the variance in the context switch benchmark. The variance is
  108. what actually happened and happens to real applications.
  109. The authors also claim "if the virtually-contiguous pages of
  110. the buffer are randomly assigned to physical addresses, as they
  111. are in many systems, ... then there is a good probability that
  112. pages of the buffer will conflict in the cache".
  113. We agree with the second part but heartily disagree with
  114. the first. It's true that NetBSD doesn't solve this problem.
  115. It doesn't follow that others don't. Any vendor supplied
  116. operating system that didn't do this on a direct mapped L2
  117. cache would suffer dramatically compared to it's competition.
  118. We know for a fact that Solaris, IRIX, and HPUX do this.
  119. A final claim is that they produced a modified version of the
  120. context switch benchmark that does not have the variance of
  121. the lmbench version. We could not support this. We ran that
  122. benchmark on an SGI MP and saw the same variance as the original
  123. benchmark.
  124. Claim 5:
  125. "The lmbench bandwidth tests use inconsistent methods of accessing
  126. memory, making it hard to directly compare the results of, say
  127. memory read bandwidth with memory write bandwidth, or file reread
  128. bandwidth with memory copy bandwidth"
  129. ...
  130. "On the Alpha processor, memory read bandwidth via array indexing
  131. is 26% faster than via pointer indirection; the Pentium Pro is
  132. 67% faster when reading with array indexing, and an unpipelined
  133. i386 is about 10% slower when writing with pointer indirection"
  134. Reply:
  135. In reading that, it would appear that they are suggesting that
  136. their numbers are up to 67% different than the lmbench numbers.
  137. We can only assume that this was delibrately misleading.
  138. Our results are identical to theirs. How can this be?
  139. . We used array indexing for reads, so did they.
  140. They /implied/ that we did it differently, when in fact
  141. we use exactly the same technique. They get about
  142. 87MB/sec on reads on a P6, so do we. We challenge
  143. the authors to demonstrate the implied 67% difference
  144. between their numbers and ours. In fact, we challenge
  145. them to demonstrate a 1% difference.
  146. . We use pointers for writes exactly because we wanted
  147. comparable numbers. The read case is a load and
  148. an integer add per word. If we used array indexing
  149. for the stores, it would be only a store per word.
  150. On older systems, the stores can appear to go faster
  151. because the load/add is slower than a single store.
  152. While the authors did their best to confuse the issue, the
  153. results speak for themselves. We coded up the write benchmark
  154. our way and their way. Results for a Intel P6:
  155. pointer array difference
  156. L1 $ 587 710 18%
  157. L2 $ 414 398 4%
  158. memory 53 53 0%
  159. Claim 5a:
  160. The harmonic mean stuff.
  161. Reply:
  162. They just don't understand modern architectures. The harmonic mean
  163. theory is fine if and only if the process can't do two things at
  164. once. Many modern processors can indeed do more than one thing at
  165. once, the concept is known as super scalar, and can and does include
  166. load/store units. If the processor supports both outstanding loads
  167. and outstanding stores, the harmonic mean theory fails.
  168. Claim 6:
  169. "we modified the memory copy bandwidth to use the same size
  170. data types as the memory read and write benchmark (which use the
  171. machine's native word size); originally, on 32-bit machines, the
  172. copy benchmark used 64-bit types whereas the memory read/write
  173. bandwidth tests used 32- bit types"
  174. Reply:
  175. The change was to use 32 bit types for bcopy. On even relatively
  176. modern systems, such as a 586, this change has no impact - the
  177. benchmark is bound by memory sub systems. On older systems, the
  178. use of multiple load/store instructions, as required for the smaller
  179. types, resulted in lower results than the meory system could produce.
  180. The processor cycles required actually slow down the results. This
  181. is still true today for in cache numbers. For example, an R10K
  182. shows L1 cache bandwidths of 750MB/sec and 377MB/sec with 64 bit
  183. vs 32 bit loads. It was our intention to show the larger number and
  184. that requires the larger types.
  185. Perhaps because the authors have not ported their benchmark to
  186. non-Intel platforms, they have not noticed this. The Intel
  187. platform does not have native 64 bit types so it does two
  188. load/stores for what C says is a 64 bit type. Just because it
  189. makes no difference on Intel does not mean it makes no difference.