description.ms 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531
  1. .\" $X$ xroff -mgs $file
  2. .\" $tty$ groff -mgs $file | colcrt - | more
  3. .\" $lpr$ groff -mgs $file > ${file}.lpr
  4. .\" Define a page top that looks cool
  5. .de PT
  6. .if \\n%>1 \{\
  7. . sp -.1i
  8. . ps 14
  9. . ft 3
  10. . nr big 24
  11. . nr space \\w'XXX'
  12. . nr titlewid \\w'\\*[title]'
  13. . nr barwid (\\n[LL]-(\\n[titlewid]+(2*\\n[space])))/2
  14. . ds ln \\l'\\n[barwid]u'\\h'-\\n[barwid]u'\v'-.25'
  15. . ds bar \\s(\\n[big]\\*(ln\\*(ln\\*(ln\\*(ln\\*(ln\v'1.25'\\h'\\n[barwid]u'\\s0
  16. . ce 1
  17. \\*[bar]\h'\\n[space]u'\v'-.15'\\*[title]\v'.15'\h'\\n[space]u'\\*[bar]
  18. . ps
  19. . sp -.70
  20. . ps 12
  21. \\l'\\n[LL]u'
  22. . ft
  23. . ps
  24. .\}
  25. ..
  26. .\" Define a page bottom that looks cool
  27. .de BT
  28. . ps 9
  29. \v'-1'\\l'\\n(LLu'
  30. . sp -1
  31. . tl '\(co 1994 \\*[author]'\\*(DY'%'
  32. . ps
  33. ..
  34. .\" Configuration
  35. .VARPS
  36. .nr HM 1.0i
  37. .nr FM 1i
  38. .if t .nr PO .75i
  39. .if t .nr LL 7.0i
  40. .if n .nr PO .25i
  41. .if n .nr LL 7.5i
  42. .nr PS 11
  43. .nr VS \n(PS+2
  44. .ds title Portable Tools for Performance Analysis
  45. .ds author Larry McVoy
  46. .TL
  47. lmbench:
  48. .sp .5
  49. \*[title]
  50. .br
  51. \s8Revision $Revision$ of $Date$\s0
  52. .AU
  53. \*[author]
  54. .AI
  55. .ps -2
  56. lm@sgi.com\**
  57. (415) 390-1804
  58. .ps +2
  59. .AB
  60. A description of a set benchmarks for measuring system performance.
  61. The benchmarks include latency measurements of basic system operations
  62. such as memory, processes, networking, and disks, and bandwidth measurements
  63. of memory, disks, and networking.
  64. The benchmarks have been run under a wide variety of Unix systems.
  65. The benchmarks are freely distributed under
  66. the GNU General Public License, with the additional restriction
  67. that results may be reported only if the benchmarks are unmodified.
  68. .AE
  69. .sp 2
  70. .if t .2C
  71. .FS
  72. This work was mostly done while the author was an employee of Sun Microsystems
  73. Computer Corporation.
  74. .FE
  75. .NH 1
  76. Introduction
  77. .LP
  78. The purpose of this project is to provide the computer community with tools
  79. for performance analysis of basic operations of their computer systems.
  80. The tools are designed
  81. to be both portable and comparable over a wide set of Unix systems.\**
  82. .FS
  83. The tools have been run on
  84. AIX,
  85. BSDI,
  86. HP-UX,
  87. IRIX,
  88. Linux,
  89. NetBSD,
  90. OSF/1,
  91. Solaris,
  92. and
  93. SunOS by the author.
  94. .FE
  95. The interfaces that the tools use have been carefully chosen to be as portable
  96. and standard as possible. It is an explicit intent of the benchmark to measure
  97. standard interfaces. Users of this benchmark may not report results from
  98. modified versions of the benchmarks.\**
  99. .FS
  100. For example, the context switch benchmark may not use a \f(CWyield()\fP
  101. primitive instead of pipes; the networking benchmarks must use the socket
  102. interfaces, not TLI or some other interface.
  103. .FE
  104. .PP
  105. The purpose of
  106. this document is to describe each of the benchmarks.
  107. .PP
  108. The benchmarks are loosely divided into latency, bandwidth, and ``other''
  109. categories.
  110. .NH 1
  111. Latency measurements
  112. .LP
  113. The latency measurements included in this suite are process creation times
  114. (including address space extension via mmap()),
  115. basic operating system entry cost, context switching, inter process
  116. communication, file system latency,
  117. disk latency (you must be the super user to get
  118. disk latency results), and memory latency.
  119. .PP
  120. Process benchmarks are used to measure the basic process primitives,
  121. such as creating a new process, running a different program, and context
  122. switching. Process creation benchmarks are of particular interest
  123. to distributed systems since many remote operations include the creation
  124. of a remote process to shepherd the remote operation to completion.
  125. Context switching is important for the same reasons.
  126. .PP
  127. Inter process communication latency is important because many operations
  128. are control messages that tell another process (frequently on another
  129. system) to do something. The latency of telling the remote process to
  130. do something is pure overhead and is frequently in the critical path
  131. of important functions, such as distributed databases.\**
  132. .FS
  133. The performance of the TCP latency benchmark has proven to be a good
  134. estimate of the performance of the Oracle database lock manager.
  135. .FE
  136. .PP
  137. The inter process communication latency benchmarks are roughly the same
  138. idea: pass a small message (a byte or so) back and forth between two
  139. processes. The reported results are always the microseconds it takes
  140. to do one round trip. If you are interested in a one way timing, then
  141. about half the round trip is right (however, the CPU cycles tend to be
  142. somewhat asymmetric for a one trip).
  143. .NH 2
  144. Process forks/exits
  145. .LP
  146. Create a child process which does nothing but
  147. terminate. Results are reported in creations per second.
  148. The benchmark is measuring how fast the OS can create a new address
  149. space and process context.
  150. The child process is spawned via the \f(CBfork\fP() interface,
  151. not the \f(CBvfork\fP() interface.
  152. .NH 2
  153. Simple process creates I
  154. .LP
  155. Create a child process which then runs a new program that does nothing
  156. but print ``hello world'' and exit. The difference between this
  157. benchmark and the previous is the running of a new program. The
  158. time difference between this and the previous benchmark is the cost
  159. of starting a new (simple) program. That cost is especially noticeable
  160. on (some) systems that have shared libraries. Shared libraries can
  161. introduce a substantial (10s of milliseconds) start up cost. This
  162. benchmark is intended to quantify the time/space tradeoff of shared
  163. libraries.
  164. .NH 2
  165. Simple process creates II
  166. .LP
  167. Create a child process which runs the same new program except that the
  168. program is started by the system shell. This is a clone of the C
  169. library \f(CBsystem\fP() interface. The intent is to educate users
  170. about the cost of this interface. I have long felt that using the
  171. Bourne shell, especially a dynamically linked Bourne shell, to start up
  172. processes is over kill; perhaps these numbers will convince others of the
  173. same thing. A better choice would be Plan 9's \f(CBrc\fP shell (which
  174. is, by the way, free software).
  175. .NH 2
  176. Memory mapping
  177. .LP
  178. Memory mapping is the process of making a file part of a process' address
  179. space, allowing direct access to the file's pages. It is an alternative
  180. to the traditional read and write interfaces. Memory mapping is extensively
  181. used for linking in shared libraries at run time. This benchmark measures
  182. the speed at which mappings can be created as well as removed. Results
  183. are reported in mappings per second, and the results can be graphed as the
  184. test is run over a series of different sizes.
  185. .NH 2
  186. Context switches
  187. .LP
  188. Measures process context switch time.\** A context switch is defined as
  189. the time it takes to save the state of one process and restore the state
  190. of another process.
  191. Typical context switch benchmarks measure just the minimal context switch
  192. time, i.e., the time to switch between two processes that are doing nothing
  193. but context switching. That approach is misleading because systems may
  194. have multiple active processes and the processes typically have more state
  195. (hot cache lines) than just the code required to force another context
  196. switch. This benchmark takes that into consideration and varies both
  197. the number and the size of the processes.
  198. .FS
  199. A previous version of this benchmark included several system calls
  200. in addition to the context switch, resulting in grossly over inflated
  201. context switch times.
  202. .FE
  203. .PP
  204. The benchmark is a ring of two to twenty processes that are connected
  205. with Unix pipes. A token is passed from process to process, forcing
  206. context switches. The benchmark measures the time it takes to pass
  207. the token two thousand times from process to process. Each hand off
  208. of the token has two costs: (a) the context switch, and (b) the cost
  209. of passing the token. In order to get just the context switching time,
  210. the benchmark first measures the cost of passing the token through a
  211. ring of pipes in a single process. This time is defined as the cost
  212. of passing the token and is not included in the reported context switch
  213. time.
  214. .PP
  215. When the processes are larger than the default baseline of ``zero''
  216. (where zero means just big enough to do the benchmark), the cost
  217. of the context switch includes the cost of restoring user level
  218. state (cache lines). This is accomplished by having the process
  219. allocate an array of data and sum it as a series of integers
  220. after receiving the token but before passing the token to the
  221. next process. Note that the overhead mentioned above includes
  222. the cost of accessing the data but because it is measured in
  223. just one address space, the cost is typically the cost with hot
  224. caches. So the context switch time does not include anything
  225. other than the context switch provided that all the processes
  226. fit in the cache. If there are cache misses (as is common), the
  227. cost of the context switch includes the cost of those cache misses.
  228. .PP
  229. Results for an HP system running at 100 mhz are shown below.
  230. This is a particularly nice system for this benchmark because the
  231. results are quite close to what is expected from a machine with a
  232. 256KB cache. As the size and number of processes are both increased,
  233. processes start falling out of the cache, resulting in higher context
  234. switch times.
  235. .LP
  236. .so ctx.pic
  237. .NH 2
  238. Null system calls
  239. .LP
  240. Measures the cost of entering and exiting (without pausing) the
  241. operating system. This is accomplished by repeatedly writing one byte
  242. to \f(CB/dev/null\fP, a pseudo device driver that does nothing but
  243. discard the data. Results are reported as system calls per second.
  244. .PP
  245. It is important to note that the system call chosen actually does the
  246. work on all systems, to the best of my knowledge. There are some
  247. systems that optimized trivial system calls, such as \f(CBgetpid\fP(),
  248. to return the answer without a true entry into the OS proper. Writing
  249. to \f(CB/dev/null\fP has not been optimized.
  250. .NH 2
  251. Pipe latency
  252. .LP
  253. This benchmark measures the OS; there is almost no code executed at
  254. user level. The benchmark measures the round trip time of a small message
  255. being passed back and forth between two processes through a pair of
  256. Unix pipes.
  257. .NH 2
  258. TCP/IP latency
  259. .LP
  260. This benchmark measures the OS
  261. networking code and the driver code; there is almost no code executed at
  262. user level. The benchmark measures the round trip time of a small message
  263. being passed back and forth between two processes through an AF_INET
  264. socket. Note that both remote and local results may be reported.
  265. .NH 2
  266. UDP/IP latency
  267. .LP
  268. This benchmark measures the OS
  269. networking code and the driver code; there is almost no code executed at
  270. user level. The benchmark measures the round trip time of a small message
  271. being passed back and forth between two processes through an AF_INET socket.
  272. Note that both remote
  273. and local results may be reported.
  274. .LP
  275. It is interesting to note that the TCP performance is sometimes
  276. greater than the UDP performance.
  277. This is contrary to expectations since
  278. the TCP protocol is a reliable, connection oriented protocol, and as such
  279. is expected to carry more overhead.
  280. Why this is so is an exercise left to the
  281. reader.
  282. .NH 2
  283. RPC latency (TCP and UDP)
  284. .LP
  285. Actually two latency benchmarks: Sun RPC over TCP/IP and over UDP/IP.
  286. This benchmark consists of the user level RPC code layered over the TCP
  287. or UDP sockets. The benchmark measures the round trip time of a small
  288. message being passed back and forth between two processes. Note that
  289. both remote and local results may be reported.
  290. .LP
  291. Using the TCP or the UDP benchmarks as a baseline, it
  292. is possible to see how much the RPC code is costing.
  293. .NH 2
  294. TCP/IP connect latency
  295. .LP
  296. This benchmarks measures the time it takes to get a TCP/IP socket and
  297. connect it to a remote server.
  298. .NH 2
  299. File system latency
  300. .LP
  301. A benchmark that measures how fast the file system can do basic, common
  302. operations, such as creates and deletes of small files.
  303. .NH 2
  304. Page fault latency
  305. .LP
  306. A benchmark that measures how fast the file system can pagefault in a
  307. page that is not in memory.
  308. .NH 2
  309. Disk latency
  310. .LP
  311. A benchmark that is designed to measure the overhead of a disk
  312. operation. Results are reported as operations per second.
  313. .PP
  314. The benchmark is designed with SCSI disks in mind. It actually simulates
  315. a large number of disks in the following way. The benchmark reads 512 byte
  316. chunks sequentially from the raw disk device (raw disks are unbuffered
  317. and are not read ahead by Unix). The benchmark ``knows'' that most
  318. disks have read ahead buffers that read ahead the next 32-128 kilobytes.
  319. Furthermore, the benchmark ``knows'' that the disks rotate and read ahead
  320. faster than the processor can request the chunks of data.\**
  321. .FS
  322. This may not always be true - a processor could be fast enough to make the
  323. requests faster than the rotating disk. If we take 3MB/sec to be disk
  324. speed, a fair speed, and divide that by 512, that is 6144 IOs/second, or
  325. 163 microseconds per IO. I don't know of any processor/OS/io controller
  326. combinations that can do an
  327. IO in 163 microseconds.
  328. .FE
  329. So the benchmark is basically reading small chunks of data from the
  330. disks track buffer. Another way to look at this is that the benchmark
  331. is doing memory to memory transfers across a SCSI channel.
  332. .PP
  333. No matter how you look at it, the resulting number represents a
  334. \fBlower\fP bound on the overhead of a disk I/O. In point of fact,
  335. the real numbers will be higher on SCSI systems. Most SCSI controllers
  336. will not disconnect if the request can be satisfied immediately; that is
  337. the case here. In practice, the overhead numbers will be higher because
  338. the processor will send the request, disconnect, get interrupted,
  339. reconnect, and transfer.
  340. .PP
  341. It is possible to generate loads of upwards of 500 IOPs on a single
  342. SCSI disk using this technique. It is useful to do that to figure out
  343. how many drives could be supported on a system before there are no
  344. more processor cycles to handle the load. Using this trick, you
  345. do not have to hook up 30 drives, you simulate them.
  346. .NH 2
  347. Memory read latency
  348. .LP
  349. This is perhaps the most interesting benchmark in the suite. The
  350. entire memory hierarchy is measured, including onboard cache latency
  351. and size, external cache latency and size, main memory latency, and TLB
  352. miss latency.
  353. .PP
  354. The benchmark varies two parameters, array size and array stride.
  355. For each size, a list of pointers is created for all of the different
  356. strides. Then the list is walked like so
  357. .DS
  358. .ft CB
  359. mov r0,(r0) # C code: p = *p;
  360. .DE
  361. The time to do about fifty thousand loads (the list wraps) is measured and
  362. reported. The time reported is pure latency time and may be zero even though
  363. the load instruction does not execute in zero time. Zero is defined as one
  364. clock cycle; in other words, the time reported is \fBonly\fP memory latency
  365. time, as it does not include the instruction execution time. It is assumed
  366. that all processors can do a load instruction (not counting stalls) in one
  367. processor cycle. In other words, if the processor cache load time
  368. is 60 nanoseconds on a 20 nanosecond processor, the load latency reported
  369. would be 40 nanoseconds, the missing 20 seconds is for the load instruction
  370. itself. Processors that can manage to get the load address out to the
  371. address pins before the end of the load cycle get some free time in this
  372. benchmark (I don't think any processors can do that).
  373. .PP
  374. Note that this benchmark has been validated by logic analyzer measurements
  375. on an SGI indy. The
  376. clever reader might realize that last few nanoseconds of inaccuracy could be
  377. rounded off by realizing that the latency is always going to be a multiple
  378. of the processor clock rate.
  379. .PP
  380. The raw data is a series of data sets. Each data set is a stride size,
  381. with array size varied from about one kilobyte up to eight megabytes.
  382. When these data sets are all plotted together (using a log base 2 scale
  383. for the size variable), the data will be seen to contain a series of
  384. horizontal plateaus. The first is the onboard data cache latency (if there
  385. is an onboard cache). The point where the lines start to go up marks the
  386. size of the cache. The second is the external cache, the third is the
  387. main memory, and the last is main memory plus TLB miss cost. In addition
  388. to this information, the cache line size can be derived by noticing which
  389. strides are faster than main memory times. The first stride that is
  390. main memory speed is likely to be the cache line size. The reason is
  391. that the strides that are faster than memory indicate that the benchmark is
  392. getting more than one hit per cache line. Note that prefetching may confuse
  393. you.
  394. .PP
  395. The graph below shows a particularly nicely made machine, a DEC alpha.
  396. This machine is nice because (a) it shows the latencies and sizes of
  397. the on chip level 1 and motherboard level 2 caches, and (b) because it
  398. has the best all around numbers, especially considering it can support a
  399. 4MB level 2 cache. Nice work, DEC.
  400. .so mem.pic
  401. .NH 1
  402. Bandwidth measurements
  403. .LP
  404. One of my former managers\** once noted that ``Unix is Swahili for bcopy().''
  405. I believe that he was indicating his belief that the operating system spent
  406. most of its time moving data from one place to another, via various means.
  407. I tend to agree and have measured the various ways that data can be moved.
  408. The ways that are measured are: through pipes, TCP sockets, library bcopy()
  409. and hand unrolled bcopy(), the read() interface, through the mmap() interface,
  410. and direct memory read and write (no copying).
  411. .FS
  412. Ken Okin
  413. .FE
  414. .NH 2
  415. Pipe bandwidth
  416. .LP
  417. Bandwidth measurement between two local processes communicating through
  418. a Unix pipe. Results are in megabytes per second.
  419. .NH 2
  420. TCP/IP socket bandwidth
  421. .LP
  422. Bandwidth measurement using TCP/IP sockets. Results are reported in megabytes
  423. per second.
  424. Results are reported for local, ethernet, FDDI, and ATM, where possible.
  425. Results range from 1-10+ megabytes per second. Any system delivering
  426. more than 10 MB/second over TCP is doing very well by 1994 standards.
  427. .PP
  428. Note that for local measurements, the system is actually moving
  429. twice as much data, since the data is being moved to/from the same host.
  430. .PP
  431. Local bandwidths are (sometimes) useful for determining the overhead of the
  432. protocol stack (as well as other OS tasks, such as context switching).
  433. Note, however, that some implementations (such as Solaris 2.x) have
  434. ``fast pathed'' loopback IP which skews the results. The fast path
  435. uses a larger MTU and does not do checksums.
  436. .PP
  437. The sockets are configured to use the largest receive/send buffers that the OS
  438. will allow. This is done to allow maximum bandwidth. Sun's 4.x TCP/IP
  439. subsystem (and probably BSD's as well) default to 4KB send/receive buffers,
  440. which is too small. (It would be better if the OS noted that this was a
  441. high volume / high bandwidth connection and automatically grew the buffers.
  442. Hint, hint.)
  443. .NH 2
  444. bcopy bandwidths
  445. .LP
  446. A simple benchmark that measures how fast data can be copied. A hand
  447. unrolled version and the C library version are tested. Results are
  448. reported in megabytes per second. Note that a typical system is actually
  449. moving about three times as much memory as the reported result. A copy
  450. is actually a read, a write which causes a cache line read, and a write
  451. back.
  452. .NH 2
  453. Read bandwidth
  454. .LP
  455. Most VM system cache file pages for reuse. This benchmark measures the
  456. speed at which those pages can be reused. It is important to notice
  457. that this is not a disk read measurement, it is a memory read measurement.
  458. Results are reported in megabytes per second.
  459. .NH 2
  460. Mmap read bandwidth
  461. .LP
  462. The same measurement as the previous benchmark except that it maps the
  463. file, avoiding the copy from kernel to user buffer.
  464. Results are reported in megabytes per second.
  465. .NH 2
  466. Memory read bandwidth
  467. .LP
  468. A large array is repeatedly read sequentially.
  469. Results reported in megabytes per second.
  470. .NH 2
  471. Memory write bandwidth
  472. .LP
  473. A large array is repeatedly written sequentially.
  474. Results reported in megabytes per second.
  475. .NH 1
  476. Other measurements
  477. .LP
  478. .NH 2
  479. Processor cycle time
  480. mhz
  481. .LP
  482. Calculates the megahertz and clock speed of the processor. This is the
  483. standard loop in which a series of interlocked operations are timed,
  484. and then the megahertz is derived from the timing. The operations
  485. are purposefully interlocked to overcome any super scalerness of the
  486. system under test.
  487. .PP
  488. There are actually three versions of mhz, a generic one that works on
  489. most systems, and two specific versions for SuperSPARC and rs6000
  490. systems.
  491. .PP
  492. It turns out that the
  493. SuperSPARC processor has two ALU's that are run at twice the clock rate,
  494. allowing two interlocked operations to complete in one processor clock.\**
  495. .FS
  496. Credit and thanks to John Mashey of SGI/MIPS fame, who kindly took the
  497. time to out why the benchmark wasn't working on SuperSPARC
  498. systems. He explained the SuperSPARC pipeline and the solution to the
  499. problem.
  500. .FE
  501. Fortunately, the ALU's are asymmetric and can not do two shifts in
  502. one processor clock. Shifts are used on SuperSPARC systems.
  503. .PP
  504. IBM rs6000 systems have a C compiler that does not honor the
  505. ``register'' directive in unoptimized code. The IBM loop looks
  506. like it is doing half as many instructions as the others. This
  507. is on purpose, each add on the IBM is actually two instructions
  508. (I think it is a load/add/store or something like that).
  509. .NH 1
  510. Acknowledgments
  511. .LP
  512. I would like to acknowledge Sun Microsystems for supporting the development
  513. of this project. In particular, my personal thanks to Paul Borrill,
  514. Director of the Architecture and Performance group, for conceiving and
  515. supporting the development of these benchmarks.
  516. .PP
  517. My thanks to John Mashey and Neal Nuckolls of Silicon Graphics for reviews,
  518. comments, and explanations of the more obscure problems.
  519. .PP
  520. My thanks to Satya Nishtala of Sun Microsystems for (a) listening to me
  521. complain about memory latencies over and over, (b) doing something about
  522. it in future SPARC systems, and (c) reviewing the memory latency results
  523. and explained IBM's sub blocking scheme (I still don't really understand
  524. it but he does. Ask him).
  525. .NH 1
  526. Obtaining the benchmarks
  527. .LP
  528. The benchmarks will be posted to the Usenet comp.benchmarks group. In addition,
  529. mail sent to \f(CBarchives@slovax.engr.sgi.com\fP with a request for
  530. \f(CBlmbench.shar\fP
  531. sources will get the latest and greatest.