| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147 | 
							- Filename: 151-path-selection-improvements.txt
 
- Title: Improving Tor Path Selection
 
- Version:
 
- Last-Modified:
 
- Author: Fallon Chen, Mike Perry
 
- Created: 5-Jul-2008
 
- Status: Draft
 
- Overview
 
-   The performance of paths selected can be improved by adjusting the
 
-   CircuitBuildTimeout and avoiding failing guard nodes. This proposal
 
-   describes a method of tracking buildtime statistics at the client, and 
 
-   using those statistics to adjust the CircuitBuildTimeout.
 
- Motivation
 
-   Tor's performance can be improved by excluding those circuits that
 
-   have long buildtimes (and by extension, high latency). For those Tor
 
-   users who require better performance and have lower requirements for
 
-   anonymity, this would be a very useful option to have.
 
- Implementation
 
-   Storing Build Times
 
-     Circuit build times will be stored in the circular array
 
-     'circuit_build_times' consisting of uint16_t elements as milliseconds.
 
-     The total size of this array will be based on the number of circuits
 
-     it takes to converge on a good fit of the long term distribution of
 
-     the circuit builds for a fixed link. We do not want this value to be
 
-     too large, because it will make it difficult for clients to adapt to
 
-     moving between different links.
 
-     From our initial observations, this value appears to be on the order 
 
-     of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
 
-     The exact value for this #define will be determined by performing
 
-     goodness of fit tests using measurments obtained from the shufflebt.py
 
-     script from TorFlow.
 
-  
 
-   Long Term Storage
 
-     The long-term storage representation will be implemented by storing a 
 
-     histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when 
 
-     writing out the statistics to disk. The format of this histogram on disk 
 
-     is yet to be finalized, but it will likely be of the format 
 
-     'CircuitBuildTime <bin> <count>', with the total specified as 
 
-     'TotalBuildTimes <total>'
 
-     Example:
 
-     TotalBuildTimes 100
 
-     CircuitBuildTimeBin 1 50
 
-     CircuitBuildTimeBin 2 25
 
-     CircuitBuildTimeBin 3 13
 
-     ...
 
-     Reading the histogram in will entail multiplying each bin by the 
 
-     BUILDTIME_BIN_WIDTH and then inserting <count> values into the 
 
-     circuit_build_times array each with the value of
 
-     <bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the 
 
-     values in the circular array, a form of index skipping must
 
-     be employed. Values from bin #N with bin count C and total T
 
-     will occupy indexes specified by N+((T/C)*k)-1, where k is the
 
-     set of integers ranging from 0 to C-1.
 
-     For example, this would mean that the values from bin 1 would
 
-     occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
 
-     The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
 
-     will be inserted at the first empty position in the array greater 
 
-     than the selected index (which may requiring looping around the 
 
-     array back to index 0).
 
-   Learning the CircuitBuildTimeout
 
-     Based on studies of build times, we found that the distribution of
 
-     circuit buildtimes appears to be a Pareto distribution. 
 
-     We will calculate the parameters for a Pareto distribution
 
-     fitting the data using the estimators at
 
-     http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
 
-     The timeout itself will be calculated by solving the CDF for the 
 
-     a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
 
-     represents the percentage of paths the Tor client will accept out of
 
-     the total number of paths. We have not yet determined a good
 
-     cutoff for this mathematically, but 85% seems a good choice for now.
 
-     From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
 
-     the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm. 
 
-   Testing
 
-     After circuit build times, storage, and learning are implemented,
 
-     the resulting histogram should be checked for consistency by
 
-     verifying it persists across successive Tor invocations where 
 
-     no circuits are built. In addition, we can also use the existing
 
-     buildtime scripts to record build times, and verify that the histogram 
 
-     the python produces matches that which is output to the state file in Tor,
 
-     and verify that the Pareto parameters and cutoff points also match.
 
-   
 
-   Soft timeout vs Hard Timeout
 
-    
 
-     At some point, it may be desirable to change the cutoff from a 
 
-     single hard cutoff that destroys the circuit to a soft cutoff and
 
-     a hard cutoff, where the soft cutoff merely triggers the building
 
-     of a new circuit, and the hard cutoff triggers destruction of the 
 
-     circuit.
 
-     Good values for hard and soft cutoffs seem to be 85% and 65% 
 
-     respectively, but we should eventually justify this with observation.
 
-   When to Begin Calculation
 
-     The number of circuits to observe (NCIRCUITS_TO_CUTOFF) before 
 
-     changing the CircuitBuildTimeout will be tunable via a #define. From 
 
-     our measurements, a good value for NCIRCUITS_TO_CUTOFF appears to be 
 
-     on the order of 100.
 
-   Dealing with Timeouts
 
-     Timeouts should be counted as the expectation of the region of 
 
-     of the Pareto distribution beyond the cutoff. The proposal will
 
-     be updated with this value soon.
 
-     Also, in the event of network failure, the observation mechanism 
 
-     should stop collecting timeout data.
 
-   Client Hints
 
-     Some research still needs to be done to provide initial values
 
-     for CircuitBuildTimeout based on values learned from modem
 
-     users, DSL users, Cable Modem users, and dedicated links. A 
 
-     radiobutton in Vidalia should eventually be provided that
 
-     sets CircuitBuildTimeout to one of these values and also 
 
-     provide the option of purging all learned data, should any exist.
 
-     These values can either be published in the directory, or
 
-     shipped hardcoded for a particular Tor version.
 
-     
 
- Issues
 
-   Impact on anonymity
 
-     Since this follows a Pareto distribution, large reductions on the
 
-     timeout can be achieved without cutting off a great number of the
 
-     total paths. This will eliminate a great deal of the performance
 
-     variation of Tor usage.
 
 
  |