| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990 | 
							- Filename: 108-mtbf-based-stability.txt
 
- Title: Base "Stable" Flag on Mean Time Between Failures
 
- Version: $Revision$
 
- Last-Modified: $Date$
 
- Author: Nick Mathewson
 
- Created: 10-Mar-2007
 
- Status: Closed
 
- Implemented-In: 0.2.0.x
 
- Overview:
 
-    This document proposes that we change how directory authorities set the
 
-    stability flag from inspection of a router's declared Uptime to the
 
-    authorities' perceived mean time between failure for the router.
 
- Motivation:
 
-    Clients prefer nodes that the authorities call Stable.  This flag is (as
 
-    of 0.2.0.0-alpha-dev) set entirely based on the node's declared value for
 
-    uptime.  This creates an opportunity for malicious nodes to declare
 
-    falsely high uptimes in order to get more traffic.
 
- Spec changes:
 
-    Replace the current rule for setting the Stable flag with:
 
-    "Stable" -- A router is 'Stable' if it is active and its observed Stability
 
-    for the past month is at or above the median Stability for active routers.
 
-    Routers are never called stable if they are running a version of Tor
 
-    known to drop circuits stupidly. (0.1.1.10-alpha through 0.1.1.16-rc
 
-    are stupid this way.)
 
-    Stability shall be defined as the weighted mean length of the runs
 
-    observed by a given directory authority.  A run begins when an authority
 
-    decides that the server is Running, and ends when the authority decides
 
-    that the server is not Running.  In-progress runs are counted when
 
-    measuring Stability.  When calculating the mean, runs are weighted by
 
-    $\alpha ^ t$, where $t$ is time elapsed since the end of the run, and
 
-    $0 < \alpha < 1$.  Time when an authority is down do not count to the
 
-    length of the run.
 
- Rejected Alternative:
 
-    "A router's Stability shall be defined as the sum of $\alpha ^ d$ for every
 
-    $d$ such that the router was considered reachable for the entire day
 
-    $d$ days ago.
 
-    This allows a simpler implementation: every day, we multiply
 
-    yesterday's Stability by alpha, and if the router was observed to be
 
-    available every time we looked today, we add 1.
 
-    Instead of "day", we could pick an arbitrary time unit.  We should
 
-    pick alpha to be high enough that long-term stability counts, but low
 
-    enough that the distant past is eventually forgotten.  Something
 
-    between .8 and .95 seems right.
 
-    (By requiring that routers be up for an entire day to get their
 
-    stability increased, instead of counting fractions of a day, we
 
-    capture the notion that stability is more like "probability of
 
-    staying up for the next hour" than it is like "probability of being
 
-    up at some randomly chosen time over the next hour."  The former
 
-    notion of stability is far more relevant for long-lived circuits.)
 
- Limitations:
 
-    Authorities can have false positives and false negatives when trying to
 
-    tell whether a router is up or down.  So long as these aren't terribly
 
-    wrong, and so long as they aren't significantly biased, we should be able
 
-    to use them to estimate stability pretty well.
 
-    Probing approaches like the above could miss short incidents of
 
-    downtime.  If we use the router's declared uptime, we could detect
 
-    these: but doing so would penalize routers who reported their uptime
 
-    accurately.
 
- Implementation:
 
-    For now, the easiest way to store this information at authorities
 
-    would probably be in some kind of periodically flushed flat file.
 
-    Later, we could move to Berkeley db or something if we really had to.
 
-    For each router, an authority will need to store:
 
-      The router ID.
 
-      Whether the router is up.
 
-      The time when the current run started, if the router is up.
 
-      The weighted sum length of all previous runs.
 
-      The time at which the weighted sum length was last weighted down.
 
-    Servers should probe at random intervals to test whether servers are
 
-    running.
 
 
  |