123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531 |
- .\" $X$ xroff -mgs $file
- .\" $tty$ groff -mgs $file | colcrt - | more
- .\" $lpr$ groff -mgs $file > ${file}.lpr
- .\" Define a page top that looks cool
- .de PT
- .if \\n%>1 \{\
- . sp -.1i
- . ps 14
- . ft 3
- . nr big 24
- . nr space \\w'XXX'
- . nr titlewid \\w'\\*[title]'
- . nr barwid (\\n[LL]-(\\n[titlewid]+(2*\\n[space])))/2
- . ds ln \\l'\\n[barwid]u'\\h'-\\n[barwid]u'\v'-.25'
- . ds bar \\s(\\n[big]\\*(ln\\*(ln\\*(ln\\*(ln\\*(ln\v'1.25'\\h'\\n[barwid]u'\\s0
- . ce 1
- \\*[bar]\h'\\n[space]u'\v'-.15'\\*[title]\v'.15'\h'\\n[space]u'\\*[bar]
- . ps
- . sp -.70
- . ps 12
- \\l'\\n[LL]u'
- . ft
- . ps
- .\}
- ..
- .\" Define a page bottom that looks cool
- .de BT
- . ps 9
- \v'-1'\\l'\\n(LLu'
- . sp -1
- . tl '\(co 1994 \\*[author]'\\*(DY'%'
- . ps
- ..
- .\" Configuration
- .VARPS
- .nr HM 1.0i
- .nr FM 1i
- .if t .nr PO .75i
- .if t .nr LL 7.0i
- .if n .nr PO .25i
- .if n .nr LL 7.5i
- .nr PS 11
- .nr VS \n(PS+2
- .ds title Portable Tools for Performance Analysis
- .ds author Larry McVoy
- .TL
- lmbench:
- .sp .5
- \*[title]
- .br
- \s8Revision $Revision$ of $Date$\s0
- .AU
- \*[author]
- .AI
- .ps -2
- lm@sgi.com\**
- (415) 390-1804
- .ps +2
- .AB
- A description of a set benchmarks for measuring system performance.
- The benchmarks include latency measurements of basic system operations
- such as memory, processes, networking, and disks, and bandwidth measurements
- of memory, disks, and networking.
- The benchmarks have been run under a wide variety of Unix systems.
- The benchmarks are freely distributed under
- the GNU General Public License, with the additional restriction
- that results may be reported only if the benchmarks are unmodified.
- .AE
- .sp 2
- .if t .2C
- .FS
- This work was mostly done while the author was an employee of Sun Microsystems
- Computer Corporation.
- .FE
- .NH 1
- Introduction
- .LP
- The purpose of this project is to provide the computer community with tools
- for performance analysis of basic operations of their computer systems.
- The tools are designed
- to be both portable and comparable over a wide set of Unix systems.\**
- .FS
- The tools have been run on
- AIX,
- BSDI,
- HP-UX,
- IRIX,
- Linux,
- NetBSD,
- OSF/1,
- Solaris,
- and
- SunOS by the author.
- .FE
- The interfaces that the tools use have been carefully chosen to be as portable
- and standard as possible. It is an explicit intent of the benchmark to measure
- standard interfaces. Users of this benchmark may not report results from
- modified versions of the benchmarks.\**
- .FS
- For example, the context switch benchmark may not use a \f(CWyield()\fP
- primitive instead of pipes; the networking benchmarks must use the socket
- interfaces, not TLI or some other interface.
- .FE
- .PP
- The purpose of
- this document is to describe each of the benchmarks.
- .PP
- The benchmarks are loosely divided into latency, bandwidth, and ``other''
- categories.
- .NH 1
- Latency measurements
- .LP
- The latency measurements included in this suite are process creation times
- (including address space extension via mmap()),
- basic operating system entry cost, context switching, inter process
- communication, file system latency,
- disk latency (you must be the super user to get
- disk latency results), and memory latency.
- .PP
- Process benchmarks are used to measure the basic process primitives,
- such as creating a new process, running a different program, and context
- switching. Process creation benchmarks are of particular interest
- to distributed systems since many remote operations include the creation
- of a remote process to shepherd the remote operation to completion.
- Context switching is important for the same reasons.
- .PP
- Inter process communication latency is important because many operations
- are control messages that tell another process (frequently on another
- system) to do something. The latency of telling the remote process to
- do something is pure overhead and is frequently in the critical path
- of important functions, such as distributed databases.\**
- .FS
- The performance of the TCP latency benchmark has proven to be a good
- estimate of the performance of the Oracle database lock manager.
- .FE
- .PP
- The inter process communication latency benchmarks are roughly the same
- idea: pass a small message (a byte or so) back and forth between two
- processes. The reported results are always the microseconds it takes
- to do one round trip. If you are interested in a one way timing, then
- about half the round trip is right (however, the CPU cycles tend to be
- somewhat asymmetric for a one trip).
- .NH 2
- Process forks/exits
- .LP
- Create a child process which does nothing but
- terminate. Results are reported in creations per second.
- The benchmark is measuring how fast the OS can create a new address
- space and process context.
- The child process is spawned via the \f(CBfork\fP() interface,
- not the \f(CBvfork\fP() interface.
- .NH 2
- Simple process creates I
- .LP
- Create a child process which then runs a new program that does nothing
- but print ``hello world'' and exit. The difference between this
- benchmark and the previous is the running of a new program. The
- time difference between this and the previous benchmark is the cost
- of starting a new (simple) program. That cost is especially noticeable
- on (some) systems that have shared libraries. Shared libraries can
- introduce a substantial (10s of milliseconds) start up cost. This
- benchmark is intended to quantify the time/space tradeoff of shared
- libraries.
- .NH 2
- Simple process creates II
- .LP
- Create a child process which runs the same new program except that the
- program is started by the system shell. This is a clone of the C
- library \f(CBsystem\fP() interface. The intent is to educate users
- about the cost of this interface. I have long felt that using the
- Bourne shell, especially a dynamically linked Bourne shell, to start up
- processes is over kill; perhaps these numbers will convince others of the
- same thing. A better choice would be Plan 9's \f(CBrc\fP shell (which
- is, by the way, free software).
- .NH 2
- Memory mapping
- .LP
- Memory mapping is the process of making a file part of a process' address
- space, allowing direct access to the file's pages. It is an alternative
- to the traditional read and write interfaces. Memory mapping is extensively
- used for linking in shared libraries at run time. This benchmark measures
- the speed at which mappings can be created as well as removed. Results
- are reported in mappings per second, and the results can be graphed as the
- test is run over a series of different sizes.
- .NH 2
- Context switches
- .LP
- Measures process context switch time.\** A context switch is defined as
- the time it takes to save the state of one process and restore the state
- of another process.
- Typical context switch benchmarks measure just the minimal context switch
- time, i.e., the time to switch between two processes that are doing nothing
- but context switching. That approach is misleading because systems may
- have multiple active processes and the processes typically have more state
- (hot cache lines) than just the code required to force another context
- switch. This benchmark takes that into consideration and varies both
- the number and the size of the processes.
- .FS
- A previous version of this benchmark included several system calls
- in addition to the context switch, resulting in grossly over inflated
- context switch times.
- .FE
- .PP
- The benchmark is a ring of two to twenty processes that are connected
- with Unix pipes. A token is passed from process to process, forcing
- context switches. The benchmark measures the time it takes to pass
- the token two thousand times from process to process. Each hand off
- of the token has two costs: (a) the context switch, and (b) the cost
- of passing the token. In order to get just the context switching time,
- the benchmark first measures the cost of passing the token through a
- ring of pipes in a single process. This time is defined as the cost
- of passing the token and is not included in the reported context switch
- time.
- .PP
- When the processes are larger than the default baseline of ``zero''
- (where zero means just big enough to do the benchmark), the cost
- of the context switch includes the cost of restoring user level
- state (cache lines). This is accomplished by having the process
- allocate an array of data and sum it as a series of integers
- after receiving the token but before passing the token to the
- next process. Note that the overhead mentioned above includes
- the cost of accessing the data but because it is measured in
- just one address space, the cost is typically the cost with hot
- caches. So the context switch time does not include anything
- other than the context switch provided that all the processes
- fit in the cache. If there are cache misses (as is common), the
- cost of the context switch includes the cost of those cache misses.
- .PP
- Results for an HP system running at 100 mhz are shown below.
- This is a particularly nice system for this benchmark because the
- results are quite close to what is expected from a machine with a
- 256KB cache. As the size and number of processes are both increased,
- processes start falling out of the cache, resulting in higher context
- switch times.
- .LP
- .so ctx.pic
- .NH 2
- Null system calls
- .LP
- Measures the cost of entering and exiting (without pausing) the
- operating system. This is accomplished by repeatedly writing one byte
- to \f(CB/dev/null\fP, a pseudo device driver that does nothing but
- discard the data. Results are reported as system calls per second.
- .PP
- It is important to note that the system call chosen actually does the
- work on all systems, to the best of my knowledge. There are some
- systems that optimized trivial system calls, such as \f(CBgetpid\fP(),
- to return the answer without a true entry into the OS proper. Writing
- to \f(CB/dev/null\fP has not been optimized.
- .NH 2
- Pipe latency
- .LP
- This benchmark measures the OS; there is almost no code executed at
- user level. The benchmark measures the round trip time of a small message
- being passed back and forth between two processes through a pair of
- Unix pipes.
- .NH 2
- TCP/IP latency
- .LP
- This benchmark measures the OS
- networking code and the driver code; there is almost no code executed at
- user level. The benchmark measures the round trip time of a small message
- being passed back and forth between two processes through an AF_INET
- socket. Note that both remote and local results may be reported.
- .NH 2
- UDP/IP latency
- .LP
- This benchmark measures the OS
- networking code and the driver code; there is almost no code executed at
- user level. The benchmark measures the round trip time of a small message
- being passed back and forth between two processes through an AF_INET socket.
- Note that both remote
- and local results may be reported.
- .LP
- It is interesting to note that the TCP performance is sometimes
- greater than the UDP performance.
- This is contrary to expectations since
- the TCP protocol is a reliable, connection oriented protocol, and as such
- is expected to carry more overhead.
- Why this is so is an exercise left to the
- reader.
- .NH 2
- RPC latency (TCP and UDP)
- .LP
- Actually two latency benchmarks: Sun RPC over TCP/IP and over UDP/IP.
- This benchmark consists of the user level RPC code layered over the TCP
- or UDP sockets. The benchmark measures the round trip time of a small
- message being passed back and forth between two processes. Note that
- both remote and local results may be reported.
- .LP
- Using the TCP or the UDP benchmarks as a baseline, it
- is possible to see how much the RPC code is costing.
- .NH 2
- TCP/IP connect latency
- .LP
- This benchmarks measures the time it takes to get a TCP/IP socket and
- connect it to a remote server.
- .NH 2
- File system latency
- .LP
- A benchmark that measures how fast the file system can do basic, common
- operations, such as creates and deletes of small files.
- .NH 2
- Page fault latency
- .LP
- A benchmark that measures how fast the file system can pagefault in a
- page that is not in memory.
- .NH 2
- Disk latency
- .LP
- A benchmark that is designed to measure the overhead of a disk
- operation. Results are reported as operations per second.
- .PP
- The benchmark is designed with SCSI disks in mind. It actually simulates
- a large number of disks in the following way. The benchmark reads 512 byte
- chunks sequentially from the raw disk device (raw disks are unbuffered
- and are not read ahead by Unix). The benchmark ``knows'' that most
- disks have read ahead buffers that read ahead the next 32-128 kilobytes.
- Furthermore, the benchmark ``knows'' that the disks rotate and read ahead
- faster than the processor can request the chunks of data.\**
- .FS
- This may not always be true - a processor could be fast enough to make the
- requests faster than the rotating disk. If we take 3MB/sec to be disk
- speed, a fair speed, and divide that by 512, that is 6144 IOs/second, or
- 163 microseconds per IO. I don't know of any processor/OS/io controller
- combinations that can do an
- IO in 163 microseconds.
- .FE
- So the benchmark is basically reading small chunks of data from the
- disks track buffer. Another way to look at this is that the benchmark
- is doing memory to memory transfers across a SCSI channel.
- .PP
- No matter how you look at it, the resulting number represents a
- \fBlower\fP bound on the overhead of a disk I/O. In point of fact,
- the real numbers will be higher on SCSI systems. Most SCSI controllers
- will not disconnect if the request can be satisfied immediately; that is
- the case here. In practice, the overhead numbers will be higher because
- the processor will send the request, disconnect, get interrupted,
- reconnect, and transfer.
- .PP
- It is possible to generate loads of upwards of 500 IOPs on a single
- SCSI disk using this technique. It is useful to do that to figure out
- how many drives could be supported on a system before there are no
- more processor cycles to handle the load. Using this trick, you
- do not have to hook up 30 drives, you simulate them.
- .NH 2
- Memory read latency
- .LP
- This is perhaps the most interesting benchmark in the suite. The
- entire memory hierarchy is measured, including onboard cache latency
- and size, external cache latency and size, main memory latency, and TLB
- miss latency.
- .PP
- The benchmark varies two parameters, array size and array stride.
- For each size, a list of pointers is created for all of the different
- strides. Then the list is walked like so
- .DS
- .ft CB
- mov r0,(r0) # C code: p = *p;
- .DE
- The time to do about fifty thousand loads (the list wraps) is measured and
- reported. The time reported is pure latency time and may be zero even though
- the load instruction does not execute in zero time. Zero is defined as one
- clock cycle; in other words, the time reported is \fBonly\fP memory latency
- time, as it does not include the instruction execution time. It is assumed
- that all processors can do a load instruction (not counting stalls) in one
- processor cycle. In other words, if the processor cache load time
- is 60 nanoseconds on a 20 nanosecond processor, the load latency reported
- would be 40 nanoseconds, the missing 20 seconds is for the load instruction
- itself. Processors that can manage to get the load address out to the
- address pins before the end of the load cycle get some free time in this
- benchmark (I don't think any processors can do that).
- .PP
- Note that this benchmark has been validated by logic analyzer measurements
- on an SGI indy. The
- clever reader might realize that last few nanoseconds of inaccuracy could be
- rounded off by realizing that the latency is always going to be a multiple
- of the processor clock rate.
- .PP
- The raw data is a series of data sets. Each data set is a stride size,
- with array size varied from about one kilobyte up to eight megabytes.
- When these data sets are all plotted together (using a log base 2 scale
- for the size variable), the data will be seen to contain a series of
- horizontal plateaus. The first is the onboard data cache latency (if there
- is an onboard cache). The point where the lines start to go up marks the
- size of the cache. The second is the external cache, the third is the
- main memory, and the last is main memory plus TLB miss cost. In addition
- to this information, the cache line size can be derived by noticing which
- strides are faster than main memory times. The first stride that is
- main memory speed is likely to be the cache line size. The reason is
- that the strides that are faster than memory indicate that the benchmark is
- getting more than one hit per cache line. Note that prefetching may confuse
- you.
- .PP
- The graph below shows a particularly nicely made machine, a DEC alpha.
- This machine is nice because (a) it shows the latencies and sizes of
- the on chip level 1 and motherboard level 2 caches, and (b) because it
- has the best all around numbers, especially considering it can support a
- 4MB level 2 cache. Nice work, DEC.
- .so mem.pic
- .NH 1
- Bandwidth measurements
- .LP
- One of my former managers\** once noted that ``Unix is Swahili for bcopy().''
- I believe that he was indicating his belief that the operating system spent
- most of its time moving data from one place to another, via various means.
- I tend to agree and have measured the various ways that data can be moved.
- The ways that are measured are: through pipes, TCP sockets, library bcopy()
- and hand unrolled bcopy(), the read() interface, through the mmap() interface,
- and direct memory read and write (no copying).
- .FS
- Ken Okin
- .FE
- .NH 2
- Pipe bandwidth
- .LP
- Bandwidth measurement between two local processes communicating through
- a Unix pipe. Results are in megabytes per second.
- .NH 2
- TCP/IP socket bandwidth
- .LP
- Bandwidth measurement using TCP/IP sockets. Results are reported in megabytes
- per second.
- Results are reported for local, ethernet, FDDI, and ATM, where possible.
- Results range from 1-10+ megabytes per second. Any system delivering
- more than 10 MB/second over TCP is doing very well by 1994 standards.
- .PP
- Note that for local measurements, the system is actually moving
- twice as much data, since the data is being moved to/from the same host.
- .PP
- Local bandwidths are (sometimes) useful for determining the overhead of the
- protocol stack (as well as other OS tasks, such as context switching).
- Note, however, that some implementations (such as Solaris 2.x) have
- ``fast pathed'' loopback IP which skews the results. The fast path
- uses a larger MTU and does not do checksums.
- .PP
- The sockets are configured to use the largest receive/send buffers that the OS
- will allow. This is done to allow maximum bandwidth. Sun's 4.x TCP/IP
- subsystem (and probably BSD's as well) default to 4KB send/receive buffers,
- which is too small. (It would be better if the OS noted that this was a
- high volume / high bandwidth connection and automatically grew the buffers.
- Hint, hint.)
- .NH 2
- bcopy bandwidths
- .LP
- A simple benchmark that measures how fast data can be copied. A hand
- unrolled version and the C library version are tested. Results are
- reported in megabytes per second. Note that a typical system is actually
- moving about three times as much memory as the reported result. A copy
- is actually a read, a write which causes a cache line read, and a write
- back.
- .NH 2
- Read bandwidth
- .LP
- Most VM system cache file pages for reuse. This benchmark measures the
- speed at which those pages can be reused. It is important to notice
- that this is not a disk read measurement, it is a memory read measurement.
- Results are reported in megabytes per second.
- .NH 2
- Mmap read bandwidth
- .LP
- The same measurement as the previous benchmark except that it maps the
- file, avoiding the copy from kernel to user buffer.
- Results are reported in megabytes per second.
- .NH 2
- Memory read bandwidth
- .LP
- A large array is repeatedly read sequentially.
- Results reported in megabytes per second.
- .NH 2
- Memory write bandwidth
- .LP
- A large array is repeatedly written sequentially.
- Results reported in megabytes per second.
- .NH 1
- Other measurements
- .LP
- .NH 2
- Processor cycle time
- mhz
- .LP
- Calculates the megahertz and clock speed of the processor. This is the
- standard loop in which a series of interlocked operations are timed,
- and then the megahertz is derived from the timing. The operations
- are purposefully interlocked to overcome any super scalerness of the
- system under test.
- .PP
- There are actually three versions of mhz, a generic one that works on
- most systems, and two specific versions for SuperSPARC and rs6000
- systems.
- .PP
- It turns out that the
- SuperSPARC processor has two ALU's that are run at twice the clock rate,
- allowing two interlocked operations to complete in one processor clock.\**
- .FS
- Credit and thanks to John Mashey of SGI/MIPS fame, who kindly took the
- time to out why the benchmark wasn't working on SuperSPARC
- systems. He explained the SuperSPARC pipeline and the solution to the
- problem.
- .FE
- Fortunately, the ALU's are asymmetric and can not do two shifts in
- one processor clock. Shifts are used on SuperSPARC systems.
- .PP
- IBM rs6000 systems have a C compiler that does not honor the
- ``register'' directive in unoptimized code. The IBM loop looks
- like it is doing half as many instructions as the others. This
- is on purpose, each add on the IBM is actually two instructions
- (I think it is a load/add/store or something like that).
- .NH 1
- Acknowledgments
- .LP
- I would like to acknowledge Sun Microsystems for supporting the development
- of this project. In particular, my personal thanks to Paul Borrill,
- Director of the Architecture and Performance group, for conceiving and
- supporting the development of these benchmarks.
- .PP
- My thanks to John Mashey and Neal Nuckolls of Silicon Graphics for reviews,
- comments, and explanations of the more obscure problems.
- .PP
- My thanks to Satya Nishtala of Sun Microsystems for (a) listening to me
- complain about memory latencies over and over, (b) doing something about
- it in future SPARC systems, and (c) reviewing the memory latency results
- and explained IBM's sub blocking scheme (I still don't really understand
- it but he does. Ask him).
- .NH 1
- Obtaining the benchmarks
- .LP
- The benchmarks will be posted to the Usenet comp.benchmarks group. In addition,
- mail sent to \f(CBarchives@slovax.engr.sgi.com\fP with a request for
- \f(CBlmbench.shar\fP
- sources will get the latest and greatest.
|