xxx-geoip-survey-plan.txt 5.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
  1. Abstract
  2. This document explains how to tell about how many Tor users there
  3. are, and how many there are in which country. Statistics are
  4. involved.
  5. Motivation
  6. There are a few reasons we need to keep track of which countries
  7. Tor users (in aggregate) are coming from:
  8. - Resource allocation. Knowing about underserved countries with
  9. lots of users can let us know about where we need to direct
  10. translation and outreach efforts.
  11. - Anticensorship. Sudden drops in usage on a national basis can
  12. indicate the arrival of a censorious firewall.
  13. - Sponsor outreach and self-evalutation. Many people and
  14. organizations who are interested in funding The Tor Project's
  15. work want to know that we're successfully serving parts of the
  16. world they're interested in, and that efforts to expand our
  17. userbase are actually succeeding. So do we.
  18. Goals
  19. We want to know about how many Tor users there are, and which
  20. countries they're in, even in the presence of a hypothetical
  21. "directory guard" feature. Some uncertainty is okay, but we'd like
  22. to be able to put a bound on the uncertainty.
  23. We need to make sure this information isn't exposed in a way that
  24. helps an adversary.
  25. Methods for curent clients:
  26. Every client downloads network status documents. There are
  27. currently three methods (one hypothetical) for clients to get them.
  28. - 0.1.2.x clients (and earlier) fetch a v2 networkstatus
  29. document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
  30. minutes].
  31. - 0.2.0.x clients fetch a v3 networkstatus consensus document
  32. at a random interval between when their current document is no
  33. longer freshest, and when their current document is about to
  34. expire.
  35. [In both of the above cases, clients choose a running
  36. directory cache at random with odds roughly proportional to
  37. its bandwidth.]
  38. - In some future version, clients will choose directory caches
  39. to serve as their "directory guards" to avoid profiling
  40. attacks, similarly to how clients currently start all their
  41. circuits at guard nodes.
  42. We assume that a directory cache can tell which of these three
  43. categories a client is in by the format of its status request.
  44. A directory cache can be made to count distinct client IP
  45. addresses that make a certain request of it in a given timeframe,
  46. and total requests made to it over that timeframe. For the first
  47. two cases, a cache can get a picture of the overall
  48. number and countries of users in the network by dividing the IP
  49. count by the probability with which they (as a cache) would be
  50. chosen. Assuming that our listed bandwidth is such that we expect
  51. to be chosen with probability P for any given request, and we've
  52. been counting IPs for long enough that we expect the average
  53. client to have made N requests, they will have visited us at least
  54. once with probability P' = 1-(1-P)^N, and so we divide the IP
  55. counts we've seen by P' for our estimate. To estimate total
  56. number of clients of a given type, determine how many requests a
  57. client of that type will make over that time, and assume we'll
  58. have seen P of them.
  59. Both of these numbers are useful: the IP counts will give the
  60. total number of IPs connecting to the network, and the request
  61. counts will give the total number of users on the network at any
  62. given time.
  63. Notes:
  64. - [Over H hours, the N for V2 clients is 2*H, and the N for V3
  65. clients is currently around N/2 or N/3. [***FIGURE THIS
  66. OUT***XXXX]]
  67. - (We should only count requests that we actually intend to answer;
  68. 503 requests shouldn't count.)
  69. - These measurements *shouldn't* be taken at directory
  70. authorities: their picture of the network is too skewed by the
  71. special cases in which clients fetch from them directly.
  72. Methods for directory guards:
  73. If directory guards are in use, directory guards get a picture of
  74. all those users who chose them as a guard when they were listed
  75. as a good choice for a guard, and who are also on the network
  76. now. The cleanest data here will come from nodes that were listed
  77. as good new-guards choices for a while, and have not been so for a
  78. while longer (to study decay rates); nodes that have been listed
  79. as good new-guard choices consistently for a long time (to get a
  80. sample of the network); and nodes that have been listed as good
  81. new-guard choices only recently (to get a sample of new users and
  82. users whose guards have died out.)
  83. Since directory guards are currently unspecified, we'll need to
  84. make some guesses about how they'll turn out to work. Here are
  85. a couple of approaches that could work.
  86. - We could have clients pick completely new directory guards on
  87. a rolling basis every two months or so. This would ensure
  88. that staying as a guard for a while would be sufficient to
  89. see a sample of users. This is potentially advantageous for
  90. load-balancing the network as well, though it might lose some
  91. of the benefits of directory guard. We need to quantify the
  92. impact of this; it might not actually make stuff worse in
  93. practice, if most guards don't stay good guards for a month
  94. or two.
  95. - We could try to collect statistics at several directory
  96. guards and combine their statisics, but we would need to make
  97. sure that for all time, at least one of the directory guards
  98. had been recommended as a good choice for new guards. By
  99. looking at new-IP rates for guards, we could get an idea of
  100. user uptake; for looking at old-IP decay rates, we could get
  101. an idea of turnover. This approach would entail significant
  102. complexity, and we'd probably need to record more information
  103. than we'd really like to.