TODO 4.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
  1. $Id$
  2. Time the make.
  3. Make a bw_tcp mode that measures bandwidth for each block and graph that
  4. as offset/bandwidth.
  5. Make the disk stuff autosize such that you get the same number of data
  6. points regardless of disk size.
  7. Make sure that all memory referencing benchmarks reference all the
  8. memory. So no partial references in the BENCH() macro, it has
  9. to call something that touches all of it.
  10. Make all benchmarks use the timing overhead for the loop and all call
  11. use_result. So all are in function calls.
  12. Integrate in the GNU os naming scheme for results & bin.
  13. [done]
  14. Fix teh getsummary to include simple calls.
  15. Make a "fast" target that does only 128 byte stride mem latency and
  16. 2 process context switches.
  17. [done]
  18. The loop overhead for bandwidths is too high - make it 128 loads.
  19. [done]
  20. Think about the issues of int/long long/double load/stores. Maybe do them
  21. all.
  22. Make all results print out bandwidths in powers of 10/sizes in powers of two.
  23. Make the compiling of the benchmark itself be part of the benchmark.
  24. memsize needs to be a little more forceful about trying for the memory.
  25. and should do gettimeofday() around each load.
  26. [done - run it three times in the script]
  27. Put sleeps in the lat_* client side in a retry loop for the port. This is
  28. in case the other guy hasn't registered yet.
  29. [I think this is OK, I found a bug in lat_rpc]
  30. Make a version of memory latency that chases N lists. This is to
  31. measure multiple outstanding load implementations.
  32. Make the lat_mem_rd walk in different strides. Try and make it so that
  33. you flip back and forth between the same cache line. So if you assume
  34. 4 byte pointers and 8 byte cache lines, you do
  35. [ x _ ] [ _ x ] [ x _ ] [ _ x]
  36. such that the stride switches between 4 & 12 bytes. Make sure this screws
  37. up HP's prefetch.
  38. The lat_fs and lat_pagefault numbers are not being reported yet.
  39. [done]
  40. Documentation on porting.
  41. Check that file size is right in the benchmarking system.
  42. Compiler version info included in results. XXX - do this!
  43. Assembler output for the files that need it.
  44. Get rid of what strings.
  45. [done]
  46. memory store latency (complex)
  47. Why can't I just use the read one and make it write?
  48. Well, because the read one is list oriented and I need to figure
  49. out reasonable math for the write case. The read one is a load
  50. per statement whereas the write one will be more work, I think.
  51. RPC numbers reserved for the benchmark.
  52. Check all the error outputs and make sure they are consistent.
  53. Each of these tests could take a quick stab at guessing the answer
  54. and adjust N to match. For example, the networking latency numbers
  55. can vary from 400 to 15,000 usecs, depending on the network.
  56. [done]
  57. On a similar note, the bandwidth measurements should autosize such that
  58. they run for at least 100 milliseconds. Over an 8K to 32MB range.
  59. [done]
  60. On all the normalized graphs, make sure that they mean the same thing.
  61. I do not think that the bandwidth measurements are "correct" in this
  62. sense.
  63. Move the k/m postfix routine into timing.c and make it an interface.
  64. Document the timing.c interfaces.
  65. Run the whole suite through gcc -Wall and fix all the errors. Also make
  66. sure that it compiles and has the right sizes for 64 bit OS.
  67. Make bw_file_rd & bw_mmap_rd include the cost of opening & mapping the
  68. file to get an apples to apples comparison. Also fix it so that they
  69. run long enough on the smaller sizes.
  70. [done]
  71. [Mon Jul 1 13:30:01 PDT 1996, after meeting w/ Kevin]
  72. Do the load latency like so
  73. loop:
  74. load r1
  75. {
  76. increase the number of nops until they start to make the
  77. run time longer - the last one was the memory latency.
  78. }
  79. use the register
  80. {
  81. increase the number of nops until they start to make the
  82. run time longer - the last one was the cache fill shadow.
  83. }
  84. repeat
  85. Do the same thing w/ a varying number of loads (& their uses), showing
  86. the number of outstanding loads implemented to L1, L2, mem.
  87. Do hand made assembler to get accurate numbers. Provide C source that
  88. mimics the hand made assembler for new machines.
  89. Think about a report format for the hardware stuff that showed the
  90. numbers as triples L1/L2/mem (or quadruples for alphas).
  91. Clock thoughts Fri Jul 12 1996: I can't count on anything greater than
  92. 10 millisecond accuracy. I don't want to depend on the counters either.
  93. That leaves me with the choice of either not doing anything that is less
  94. than 10 milliseconds, doing it in a loop, or what?