| 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788 |
- Abstract
- This document explains how to tell about how many Tor users there
- are, and how many there are in which country. Statistics are
- involved.
- Motivation
- There are a few reasons we need to keep track of which countries
- Tor users (in aggregate) are coming from:
- - Resource allocation. Knowing about underserved countries with
- lots of users can let us know about where we need to direct
- translation and outreach efforts.
- - Anticensorship. Sudden drops in usage on a national basis can
- indicate the arrival of a censorious firewall.
- - Sponsor outreach and self-evalutation. Many people and
- organizations who are interested in funding The Tor Project's
- work want to know that we're successfully serving parts of the
- world they're interested in, and that efforts to expand our
- userbase are actually succeeding. So, when you come right
- down to it, do we.
- Goals
- We want to know about how many Tor users there are, and which
- countries they're in, even in the presence of a hypothetical
- "directory guard" feature. Some uncertainty is okay, but we'd like
- to be able to put a bound on the uncertainty.
- We need to make sure this information isn't exposed in a way that
- helps an adversary.
- Methods:
- Every client downloads network status documents. There are
- currently three methods (one hypothetical) for clients to get them.
- - 0.1.2.x clients (and earlier) fetch a v2 networkstatus
- document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
- minutes].
- - 0.2.0.x clients fetch a v3 networkstatus consensus document
- at a random interval between when their current document is no
- longer freshest, and when their current document is about to
- expire.
- [In both of the above cases, clients choose a directory cache at
- random with odds roughly proportional to its bandwidth.]
- - In some future version, clients will choose directory caches
- to serve as their "directory guards" to avoid profiling
- attacks, similarly to how clients currently start all their
- circuits at guard nodes.
- We assume that a directory cache can tell which of these three
- categories a client is in by the format of its status request.
- A directory cache can be made to count distinct client IP
- addresses that make a certain request of it in a given timeframe.
- For the first two cases, a cache can get a picture of the overall
- number and countries of users in the network by dividing the IP
- count by the probability with which they (as a cache) would be
- chosen. Assuming that our listed bandwidth is such that we expect
- to be chosen with probability P for any given request, and we've
- been counting IPs for long enough that we expect the average
- client to have made N requests, they will have visited us at least
- once with probability P' = 1-(1-P)^N, and so we divide the IP
- counts we've seen by P' for our estimate.
- If directory guards are in use, directory guards get a picture of
- all those users who chose them as a guard when they were listed
- as a good choice for a guard, and who are also on the network
- now. The cleanest data here will come from nodes that were listed
- as good new-guards choices for a while, and have not been so for a
- while longer (to study decay rates); nodes that have been listed
- as good new-guard choices consistently for a long time (to get a
- sample of the network); and nodes that have been listed as good
- new-guard choices only recently (to get a sample of new users and
- users whose guards have died out.)
- Note that these measurements *shouldn't* be taken at directory
- authorities: their picture of the network is too skewed by the
- special cases in which clients fetch from them directly.
|