lat_mem_rd.8 2.8 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
  1. .\" $Id$
  2. .TH LAT_MEM_RD 8 "$Date$" "(c)1994 Larry McVoy" "LMBENCH"
  3. .SH NAME
  4. lat_mem_rd \- memory read latency benchmark
  5. .SH SYNOPSIS
  6. .B lat_mem_rd
  7. .I size_in_megabytes
  8. .I stride [stride stride...]
  9. .SH DESCRIPTION
  10. .B lat_mem_rd
  11. measures memory read latency for varying memory sizes and strides. The
  12. results are reported in nanoseconds per load and have been verified
  13. accurate to within a few nanoseconds on an SGI Indy.
  14. .PP
  15. The
  16. entire memory hierarchy is measured, including onboard cache latency
  17. and size, external cache latency and size, main memory latency, and TLB
  18. miss latency.
  19. .PP
  20. Only data accesses are measured; the instruction cache is not measured.
  21. .PP
  22. The benchmark runs as two nested loops. The outer loop is the stride size.
  23. The inner loop is the array size. For each array size, the benchmark
  24. creates a ring of pointers that point forward one stride. Traversing the
  25. array is done by
  26. .sp
  27. .ft CB
  28. p = (char **)*p;
  29. .ft
  30. .sp
  31. in a for loop (the over head of the for loop is not significant; the loop is
  32. an unrolled loop 1000 loads long). The loop stops after doing a million loads.
  33. .PP
  34. The size of the array varies from 512 bytes to (typically) eight megabytes.
  35. For the small sizes, the cache will have an effect, and the loads will be
  36. much faster. This becomes much more apparent when the data is plotted.
  37. .SH OUTPUT
  38. Output format is intended as input to \fBxgraph\fP or some similar program
  39. (I use a perl script that produces pic input).
  40. There is a set of data produced for each stride. The data set title
  41. is the stride size and the data points are the array size in megabytes
  42. (floating point value) and the load latency over all points in that array.
  43. .SH "INTERPRETING THE OUTPUT"
  44. The output is best examined in a graph where you typically get a graph
  45. that has four plateaus. The graph should plotted in log base 2 of the
  46. array size on the X axis and the latency on the Y axis. Each stride
  47. is then plotted as a curve. The plateaus that appear correspond to
  48. the onboard cache (if present), external cache (if present), main
  49. memory latency, and TLB miss latency.
  50. .PP
  51. As a rough guide, you may be able to extract the latencies of the
  52. various parts as follows, but you should really look at the graphs,
  53. since these rules of thumb do not always work (some systems do not
  54. have onboard cache, for example).
  55. .IP "onboard cache" 16
  56. Try stride of 128 and array size of .00098.
  57. .IP "external cache"
  58. Try stride of 128 and array size of .125.
  59. .IP "main memory"
  60. Try stride of 128 and array size of 8.
  61. .IP "TLB miss"
  62. Try the largest stride and the largest array.
  63. .SH BUGS
  64. This program is dependent on the correct operation of
  65. .BR mhz (8).
  66. If you are getting numbers that seem off, check that
  67. .BR mhz (8)
  68. is giving you a clock rate that you believe.
  69. .SH ACKNOWLEDGEMENT
  70. Funding for the development of
  71. this tool was provided by Sun Microsystems Computer Corporation.
  72. .SH "SEE ALSO"
  73. lmbench(8).