126-geoip-reporting.txt 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235
  1. Filename: 126-geoip-fetching.txt
  2. Title: Getting GeoIP data and publishing usage summaries
  3. Version: $Revision: 11988 $
  4. Last-Modified: $Date: 2007-10-16 12:59:42 -0400 (Tue, 16 Oct 2007) $
  5. Author: Roger Dingledine
  6. Created: 2007-11-24
  7. Status: Needs-Research
  8. 1. Background and motivation
  9. Right now we can keep a rough count of Tor users, both total and by
  10. country, by watching connections to a single directory mirror. Being
  11. able to get usage estimates is useful both for our funders (to
  12. demonstrate progress) and for our own development (so we know how
  13. quickly we're scaling and can design accordingly, and so we know which
  14. countries and communities to focus on more). This need for information
  15. is the only reason we haven't deployed "directory guards" (think of
  16. them like entry guards but for directory information; in practice,
  17. it would seem that Tor clients should simply use their entry guards
  18. as their directory guards; see also proposal 125).
  19. With the move toward bridges, we will no longer be able to track Tor
  20. clients that use bridges, since they use their bridges as directory
  21. guards. Further, we need to be able to learn which bridges stop seeing
  22. use from certain countries (and are thus likely blocked), so we can
  23. avoid giving them out to other users in those countries.
  24. Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
  25. and circuits on its 'network map', and it performs anonymized GeoIP
  26. lookups to its central servers to know where to put the dots. Vidalia
  27. caches answers it gets -- to reduce delay, to reduce overhead on
  28. the network, and to reduce anonymity issues where users reveal their
  29. knowledge about the network through which IP addresses they ask about.
  30. But with the advent of bridges, Tor clients are asking about IP
  31. addresses that aren't in the main directory. In particular, bridge
  32. users inform the central Vidalia servers about each bridge as they
  33. discover it and their Vidalia tries to map it.
  34. Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
  35. own IP address, so it can provide a more useful map.
  36. Finally, Vidalia's central servers leave users open to partitioning
  37. attacks, even if they can't target specific users. Further, as we
  38. start using GeoIP results for more operational or security-relevant
  39. goals, such as avoiding or including particular countries in circuits,
  40. it becomes more important that users can't be singled out in terms of
  41. their IP-to-country mapping beliefs.
  42. 2. The available GeoIP databases
  43. There are at least two classes of GeoIP database out there: "IP to
  44. country", which tells us the country code for the IP address but
  45. no more details, and "IP to city", which tells us the country code,
  46. the name of the city, and some basic latitude/longitude guesses.
  47. A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
  48. bytes. A typical line is:
  49. "205500992","208605279","US","USA","UNITED STATES"
  50. http://ip-to-country.webhosting.info/node/view/5
  51. Similarly, the maxmind GeoLite Country database is also about 500KB
  52. compressed.
  53. http://www.maxmind.com/app/geolitecountry
  54. The maxmind GeoLite City database gives more finegrained detail like
  55. as geo coordinates and city name. Vidalia currently makes use of this
  56. information. On the other hand it's 16MB compressed. A typical line is:
  57. 206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
  58. http://www.maxmind.com/app/geolitecity
  59. There are other databases out there, like
  60. http://www.hostip.info/faq.html
  61. http://www.webconfs.com/ip-to-city.php
  62. that want more attention, but for now let's assume that all the db's
  63. are around this size.
  64. 3. What we'd like to solve
  65. Goal #1a: Tor relays collect IP-to-country user stats and publish
  66. sanitized versions.
  67. Goal #1b: Tor bridges collect IP-to-country user stats and publish
  68. sanitized versions.
  69. Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
  70. mapping.
  71. Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
  72. can pick countries for her paths.
  73. Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
  74. Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
  75. for better mapping.
  76. Goal #5: Reduce partitioning opportunities where Vidalia central
  77. servers can give different (distinguishing) responses.
  78. 4. Solution overview
  79. Our goal is to allow Tor relays, bridges, and clients to learn enough
  80. GeoIP information so they can do local private queries.
  81. 4.1. The IP-to-country db
  82. Directory authorities should publish a "geoip" file that contains
  83. IP-to-country mappings. Directory caches will mirror it, and Tor clients
  84. and relays (including bridge relays) will fetch it. Thus we can solve
  85. goals 1a and 1b (publish sanitized usage info). Controllers could also
  86. use this to solve goal 2b (choosing path by country attributes). It
  87. also solves goal 4 (learning the Tor client's country), though for
  88. huge countries like the US we'd still need to decide where the "middle"
  89. should be when we're mapping that address.
  90. The IP-to-country details are described further in Sections 5 and
  91. 6 below.
  92. 4.2. The IP-to-city db
  93. In an ideal world, the IP-to-city db would be small enough that we
  94. could distribute it in the above manner too. But for now, it is too
  95. large. Here's where the design choice forks.
  96. Option A: Vidalia should continue doing its anonymized IP-to-city
  97. queries. Thus we can achieve goals 2a and 2b. We would solve goal
  98. 3 by only doing lookups on descriptors that are purpose "general"
  99. (see Section 4.2.1 for how). We would leave goal 5 unsolved.
  100. Option B: Each directory authority should keep an IP-to-city db,
  101. lookup the value for each router it lists, and include that line in
  102. the router's network-status entry. The network-status consensus would
  103. then use the line that appears in the majority of votes. This approach
  104. also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
  105. at all now), and goal 5 (reduced partitioning risks).
  106. Option B has the advantage that Vidalia can simplify its operation,
  107. and the advantage that this consensus IP-to-city data is available to
  108. other controllers besides just Vidalia. But it has the disadvantage
  109. that the networkstatus consensus becomes larger, even though most of
  110. the GeoIP information won't change from one consensus to the next. Is
  111. there another reasonable location for it that can provide similar
  112. consensus security properties?
  113. 4.2.1. Controllers can query for router annotations
  114. Vidalia needs to stop doing queries on bridge relay IP addresses.
  115. It could do that by only doing lookups on descriptors that are in
  116. the networkstatus consensus, but that precludes designs like Blossom
  117. that might want to map its relay locations. The best answer is that it
  118. should learn the router annotations, with a new controller 'getinfo'
  119. command:
  120. "GETINFO router-annotations/id/<OR identity>" or
  121. "GETINFO router-annotations/name/<OR nickname>"
  122. which would respond with something like
  123. @downloaded-at 2007-11-29 08:06:38
  124. @source "128.31.0.34"
  125. @purpose bridge
  126. [We could also make the answer include the digest for the router in
  127. question, which would enable us to ask GETINFO router-annotations/all.
  128. Is this worth it? -RD]
  129. Then Vidalia can avoid doing lookups on descriptors with purpose
  130. "bridge". Even better would be to add a new annotation "@private true"
  131. so Vidalia can know how to handle new purposes that we haven't created
  132. yet. Vidalia could special-case "bridge" for now, for compatibility
  133. with the current 0.2.0.x-alphas.
  134. 4.3. Recommendation
  135. My overall recommendation is that we should implement 4.1 soon
  136. (e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
  137. with the hope that later we discover a better way to distribute the
  138. IP-to-city info and can switch to 4.2 option B.
  139. Below we discuss more how to go about achieving 4.1.
  140. 5. Publishing and caching the GeoIP (IP-to-country) database
  141. Each v3 directory authority should put a copy of the "geoip" file in
  142. its datadirectory. Then its network-status votes should include a hash
  143. of this file (Recommended-geoip-hash: %s), and the resulting consensus
  144. directory should specify the consensus hash.
  145. There should be a new URL for fetching this geoip db (by "current.z"
  146. for testing purposes, and by hash.z for typical downloads). Authorities
  147. should fetch and serve the one listed in the consensus, even when they
  148. vote for their own. This would argue for storing the cached version
  149. in a better filename than "geoip".
  150. Directory mirrors should keep a copy of this file available via the
  151. same URLs.
  152. We assume that the file would change at most a few times a month. Should
  153. Tor ship with a bootstrap geoip file? An out-of-date geoip file may
  154. open you up to partitioning attacks, but for the most part it won't
  155. be that different.
  156. There should be a config option to disable updating the geoip file,
  157. in case users want to use their own file (e.g. they have a proprietary
  158. GeoIP file they prefer to use). In that case we leave it up to the
  159. user to update his geoip file out-of-band.
  160. [XXX Should consider forward/backward compatibility, e.g. if we want
  161. to move to a new geoip file format. -RD]
  162. 6. Controllers use the IP-to-country db for mapping and for path building
  163. Down the road, vidalia can use the IP-to-country mappings for placing
  164. on its map:
  165. - The location of the client
  166. - The location of the bridges, or other relays not in the
  167. networkstatus, on the map.
  168. - Any relays that it doesn't yet have an IP-to-city answer for.
  169. Other controllers can also use it to set EntryNodes, ExitNodes, etc
  170. in a per-country way.
  171. To support these features, we need to export the IP-to-country data
  172. via the Tor controller protocol.
  173. Is it sufficient just to add a new GETINFO command?
  174. GETINFO ip-to-country/128.31.0.34
  175. 250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
  176. 7. Relays and bridges use the IP-to-country db for usage summaries
  177. Once bridges have a GeoIP database locally, they can start to publish
  178. sanitized summaries of client usage -- how many users they see and from
  179. what countries. This might also be a more useful way for ordinary Tor
  180. relays to convey the level of usage they see, which would allow us to
  181. switch to using directory guards for all users by default.
  182. But how to safely summarize this information without opening too many
  183. anonymity leaks seems hard...