| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137 | 
Abstract   This document explains how to tell about how many Tor users there   are, and how many there are in which country.  Statistics are   involved.Motivation   There are a few reasons we need to keep track of which countries   Tor users (in aggregate) are coming from:      - Resource allocation.  Knowing about underserved countries with        lots of users can let us know about where we need to direct        translation and outreach efforts.      - Anticensorship.  Sudden drops in usage on a national basis can        indicate the arrival of a censorious firewall.      - Sponsor outreach and self-evalutation.  Many people and        organizations who are interested in funding The Tor Project's        work want to know that we're successfully serving parts of the        world they're interested in, and that efforts to expand our        userbase are actually succeeding.  So do we.Goals   We want to know approximately how many Tor users there are, and which   countries they're in, even in the presence of a hypothetical   "directory guard" feature.  Some uncertainty is okay, but we'd like   to be able to put a bound on the uncertainty.   We need to make sure this information isn't exposed in a way that   helps an adversary.Methods for current clients:   Every client downloads network status documents.  There are   currently three methods (one hypothetical) for clients to get them.      - 0.1.2.x clients (and earlier) fetch a v2 networkstatus        document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30        minutes].      - 0.2.0.x clients fetch a v3 networkstatus consensus document        at a random interval between when their current document is no        longer freshest, and when their current document is about to        expire.        [In both of the above cases, clients choose a running        directory cache at random with odds roughly proportional to        its bandwidth.  If they're just starting, they know a XXXX FIXME -NM]      - In some future version, clients will choose directory caches        to serve as their "directory guards" to avoid profiling        attacks, similarly to how clients currently start all their        circuits at guard nodes.    We assume that a directory cache can tell which of these three    categories a client is in by the format of its status request.    A directory cache can be made to count distinct client IP    addresses that make a certain request of it in a given timeframe,    and total requests made to it over that timeframe.  For the first    two cases, a cache can get a  picture of the overall    number and countries of users in the network by dividing the IP    count by the probability with which they (as a cache) would be    chosen.  Assuming that our listed bandwidth is such that we expect    to be chosen with probability P for any given request, and we've    been counting IPs for long enough that we expect the average    client to have made N requests, they will have visited us at least    once with probability P' = 1-(1-P)^N, and so we divide the IP    counts we've seen by P' for our estimate.  To estimate total    number of clients of a given type, determine how many requests a    client of that type will make over that time, and assume we'll    have seen P of them.    Both of these numbers are useful: the IP counts will give the    total number of IPs connecting to the network, and the request    counts will give the total number of users on the network at any    given time.    Notes:       - [Over H hours, the N for V2 clients is 2*H, and the N for V3         clients is currently around H/2 or H/3.]       - (We should only count requests that we actually intend to answer;         503 requests shouldn't count.)       - These measurements should also be taken at a directory         authority if possible: their picture of the network is skewed         by clients that fetch from them directly.  These clients,         however, are all the clients that are just bootstrapping         (assuming that the fallback-consensus feature isn't yet used         much).       - These measurements also overestimate the V2 download rate if         some downloads fail and clients retry them later after backing         off.Methods for directory guards:    If directory guards are in use, directory guards get a picture of    all those users who chose them as a guard when they were listed    as a good choice for a guard, and who are also on the network    now.  The cleanest data here will come from nodes that were listed    as good new-guards choices for a while, and have not been so for a    while longer (to study decay rates); nodes that have been listed    as good new-guard choices consistently for a long time (to get a    sample of the network); and nodes that have been listed as good    new-guard choices only recently (to get a sample of new users and    users whose guards have died out.)    Since directory guards are currently unspecified, we'll need to    make some guesses about how they'll turn out to work.  Here are    a couple of approaches that could work.       - We could have clients pick completely new directory guards on         a rolling basis every two months or so.  This would ensure         that staying as a guard for a while would be sufficient to         see a sample of users.  This is potentially advantageous for         load-balancing the network as well, though it might lose some         of the benefits of directory guard.  We need to quantify the         impact of this; it might not actually make stuff worse in         practice, if most guards don't stay good guards for a month         or two.       - We could try to collect statistics at several directory         guards and combine their statisics, but we would need to make         sure that for all time, at least one of the directory guards         had been recommended as a good choice for new guards.  By         looking at new-IP rates for guards, we could get an idea of         user uptake; for looking at old-IP decay rates, we could get         an idea of turnover.  This approach would entail significant         complexity, and we'd probably need to record more information         than we'd really like to.
 |