16 years ago · fd412549fd
--- a/doc/spec/proposals/151-path-selection-improvements.txt
+++ b/doc/spec/proposals/151-path-selection-improvements.txt
@@ -2,7 +2,7 @@ Filename: 151-path-selection-improvements.txt
 
				 Title: Improving Tor Path Selection
			
 
				 Author: Fallon Chen, Mike Perry
			
 
				 Created: 5-Jul-2008
			
 
				-Status: Draft
			
 
				+Status: Implemented
			
 
				 
			
 
				 Overview
			
 
				 
			
@@ -22,51 +22,37 @@ Implementation
 
				 
			
 
				   Storing Build Times
			
 
				 
			
 
				-    Circuit build times will be stored in the circular array
			
 
				-    'circuit_build_times' consisting of uint16_t elements as milliseconds.
			
 
				-    The total size of this array will be based on the number of circuits
			
 
				+    Circuit build times are stored in the circular array
			
 
				+    'circuit_build_times' consisting of uint32_t elements as milliseconds.
			
 
				+    The total size of this array is based on the number of circuits
			
 
				     it takes to converge on a good fit of the long term distribution of
			
 
				     the circuit builds for a fixed link. We do not want this value to be
			
 
				     too large, because it will make it difficult for clients to adapt to
			
 
				     moving between different links.
			
 
				 
			
 
				-    From our initial observations, this value appears to be on the order 
			
 
				-    of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
			
 
				-    The exact value for this #define will be determined by performing
			
 
				-    goodness of fit tests using measurments obtained from the shufflebt.py
			
 
				-    script from TorFlow.
			
 
				+    From our observations, this value appears to be on the order of 1000, 
			
 
				+    but is configurable in a #define NCIRCUITS_TO_OBSERVE.
			
 
				  
			
 
				   Long Term Storage
			
 
				 
			
 
				-    The long-term storage representation will be implemented by storing a 
			
 
				+    The long-term storage representation is implemented by storing a 
			
 
				     histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when 
			
 
				-    writing out the statistics to disk. The format of this histogram on disk 
			
 
				-    is yet to be finalized, but it will likely be of the format 
			
 
				-    'CircuitBuildTime <bin> <count>', with the total specified as 
			
 
				-    'TotalBuildTimes <total>'
			
 
				+    writing out the statistics to disk. The format this takes in the
			
 
				+    state file is 'CircuitBuildTime <bin-ms> <count>', with the total 
			
 
				+    specified as 'TotalBuildTimes <total>'
			
 
				     Example:
			
 
				 
			
 
				     TotalBuildTimes 100
			
 
				-    CircuitBuildTimeBin 1 50
			
 
				-    CircuitBuildTimeBin 2 25
			
 
				-    CircuitBuildTimeBin 3 13
			
 
				+    CircuitBuildTimeBin 0 50
			
 
				+    CircuitBuildTimeBin 50 25
			
 
				+    CircuitBuildTimeBin 100 13
			
 
				     ...
			
 
				 
			
 
				-    Reading the histogram in will entail multiplying each bin by the 
			
 
				-    BUILDTIME_BIN_WIDTH and then inserting <count> values into the 
			
 
				-    circuit_build_times array each with the value of
			
 
				-    <bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the 
			
 
				-    values in the circular array, a form of index skipping must
			
 
				-    be employed. Values from bin #N with bin count C and total T
			
 
				-    will occupy indexes specified by N+((T/C)*k)-1, where k is the
			
 
				-    set of integers ranging from 0 to C-1.
			
 
				-
			
 
				-    For example, this would mean that the values from bin 1 would
			
 
				-    occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
			
 
				-    The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
			
 
				-    will be inserted at the first empty position in the array greater 
			
 
				-    than the selected index (which may requiring looping around the 
			
 
				-    array back to index 0).
			
 
				+    Reading the histogram in will entail inserting <count> values
			
 
				+    into the circuit_build_times array each with the value of
			
 
				+    <bin-ms> milliseconds. In order to evenly distribute the values
			
 
				+    in the circular array, the Fisher-Yates shuffle will be performed
			
 
				+    after reading values from the bins.
			
 
				 
			
 
				   Learning the CircuitBuildTimeout
			
 
				 
			
@@ -77,14 +63,28 @@ Implementation
 
				     fitting the data using the estimators at
			
 
				     http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
			
 
				 
			
 
				-    The timeout itself will be calculated by solving the CDF for the 
			
 
				-    a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
			
 
				-    represents the percentage of paths the Tor client will accept out of
			
 
				-    the total number of paths. We have not yet determined a good
			
 
				-    cutoff for this mathematically, but 85% seems a good choice for now.
			
 
				+    The timeout itself is calculated by using the Quartile function (the
			
 
				+    inverted CDF) to give us the value on the CDF such that
			
 
				+    BUILDTIME_PERCENT_CUTOFF (80%) of the mass of the distribution is
			
 
				+    below the timeout value.
			
 
				 
			
 
				-    From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
			
 
				-    the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm. 
			
 
				+    Thus, we expect that the Tor client will accept the fastest 80% of 
			
 
				+    the total number of paths on the network.
			
 
				+
			
 
				+  Detecting Changing Network Conditions
			
 
				+
			
 
				+    We attempt to detect both network connectivty loss and drastic
			
 
				+    changes in the timeout characteristics. Network connectivity loss
			
 
				+    is detected by recording a timestamp every time Tor either completes
			
 
				+    a TLS connection or receives a cell. If this timestamp is more than 
			
 
				+    90 seconds in the past, circuit timeouts are no longer counted.
			
 
				+
			
 
				+    If more than MAX_RECENT_TIMEOUT_RATE (80%) of the past 
			
 
				+    RECENT_CIRCUITS (20) time out, we assume the network connection
			
 
				+    has changed, and we discard all buildtimes history and compute
			
 
				+    a new timeout by estimating a new Pareto curve using the
			
 
				+    position on the Pareto Quartile function for the ratio of
			
 
				+    timeouts. 
			
 
				 
			
 
				   Testing
			
 
				 
			
@@ -104,7 +104,7 @@ Implementation
 
				     of a new circuit, and the hard cutoff triggers destruction of the 
			
 
				     circuit.
			
 
				 
			
 
				-    Good values for hard and soft cutoffs seem to be 85% and 65% 
			
 
				+    Good values for hard and soft cutoffs seem to be 80% and 60% 
			
 
				     respectively, but we should eventually justify this with observation.
			
 
				 
			
 
				   When to Begin Calculation