miti
/
graphene


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
							$Id$

Time the make.

Make a bw_tcp mode that measures bandwidth for each block and graph that
as offset/bandwidth.

Make the disk stuff autosize such that you get the same number of data
points regardless of disk size.

Make sure that all memory referencing benchmarks reference all the
memory.  So no partial references in the BENCH() macro, it has
to call something that touches all of it.

Make all benchmarks use the timing overhead for the loop and all call
use_result.  So all are in function calls.

Integrate in the GNU os naming scheme for results & bin.  
[done]

Fix teh getsummary to include simple calls.

Make a "fast" target that does only 128 byte stride mem latency and 
2 process context switches.
[done]

The loop overhead for bandwidths is too high - make it 128 loads.
[done]

Think about the issues of int/long long/double load/stores.  Maybe do them
all.

Make all results print out bandwidths in powers of 10/sizes in powers of two.

Make the compiling of the benchmark itself be part of the benchmark.

memsize needs to be a little more forceful about trying for the memory.
and should do gettimeofday() around each load.
[done - run it three times in the script]

Put sleeps in the lat_* client side in a retry loop for the port.  This is 
in case the other guy hasn't registered yet.
[I think this is OK, I found a bug in lat_rpc]

Make a version of memory latency that chases N lists.  This is to
measure multiple outstanding load implementations.

Make the lat_mem_rd walk in different strides.  Try and make it so that
you flip back and forth between the same cache line.  So if you assume
4 byte pointers and 8 byte cache lines, you do

	[ x _ ] [ _ x ] [ x _ ] [ _ x]

such that the stride switches between 4 & 12 bytes.  Make sure this screws
up HP's prefetch.

The lat_fs and lat_pagefault numbers are not being reported yet.
[done]

Documentation on porting.

Check that file size is right in the benchmarking system.

Compiler version info included in results.  XXX - do this!

Assembler output for the files that need it.

Get rid of what strings.
[done]

memory store latency (complex)
	Why can't I just use the read one and make it write?
	Well, because the read one is list oriented and I need to figure
	out reasonable math for the write case.  The read one is a load
	per statement whereas the write one will be more work, I think.

RPC numbers reserved for the benchmark.

Check all the error outputs and make sure they are consistent.

Each of these tests could take a quick stab at guessing the answer
and adjust N to match.  For example, the networking latency numbers
can vary from 400 to 15,000 usecs, depending on the network.
	[done]

On a similar note, the bandwidth measurements should autosize such that
they run for at least 100 milliseconds.  Over an 8K to 32MB range.
	[done]

On all the normalized graphs, make sure that they mean the same thing.
I do not think that the bandwidth measurements are "correct" in this
sense.

Move the k/m postfix routine into timing.c and make it an interface.

Document the timing.c interfaces.

Run the whole suite through gcc -Wall and fix all the errors.  Also make
sure that it compiles and has the right sizes for 64 bit OS.

Make bw_file_rd & bw_mmap_rd include the cost of opening & mapping the
file to get an apples to apples comparison.  Also fix it so that they
run long enough on the smaller sizes.
	[done]

[Mon Jul  1 13:30:01 PDT 1996, after meeting w/ Kevin]

Do the load latency like so

	loop:
		load	r1
		{
		increase the number of nops until they start to make the
		run time longer - the last one was the memory latency.
		}
		use the register
		{
		increase the number of nops until they start to make the
		run time longer - the last one was the cache fill shadow.
		}
		repeat

Do the same thing w/ a varying number of loads (& their uses), showing 
the number of outstanding loads implemented to L1, L2, mem.

Do hand made assembler to get accurate numbers.  Provide C source that 
mimics the hand made assembler for new machines.

Think about a report format for the hardware stuff that showed the
numbers as triples L1/L2/mem (or quadruples for alphas).

Clock thoughts Fri Jul 12 1996:  I can't count on anything greater than
10 millisecond accuracy.  I don't want to depend on the counters either.
That leaves me with the choice of either not doing anything that is less
than 10 milliseconds, doing it in a loop, or what?