126-geoip-reporting.txt 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412
  1. Filename: 126-geoip-reporting.txt
  2. Title: Getting GeoIP data and publishing usage summaries
  3. Version: $Revision$
  4. Last-Modified: $Date$
  5. Author: Roger Dingledine
  6. Created: 2007-11-24
  7. Status: Closed
  8. Implemented-In: 0.2.0.x
  9. 0. Status
  10. In 0.2.0.x, this proposal is implemented to the extent needed to
  11. address its motivations. See notes below with the test "RESOLUTION"
  12. for details.
  13. 1. Background and motivation
  14. Right now we can keep a rough count of Tor users, both total and by
  15. country, by watching connections to a single directory mirror. Being
  16. able to get usage estimates is useful both for our funders (to
  17. demonstrate progress) and for our own development (so we know how
  18. quickly we're scaling and can design accordingly, and so we know which
  19. countries and communities to focus on more). This need for information
  20. is the only reason we haven't deployed "directory guards" (think of
  21. them like entry guards but for directory information; in practice,
  22. it would seem that Tor clients should simply use their entry guards
  23. as their directory guards; see also proposal 125).
  24. With the move toward bridges, we will no longer be able to track Tor
  25. clients that use bridges, since they use their bridges as directory
  26. guards. Further, we need to be able to learn which bridges stop seeing
  27. use from certain countries (and are thus likely blocked), so we can
  28. avoid giving them out to other users in those countries.
  29. Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
  30. and circuits on its 'network map', and it performs anonymized GeoIP
  31. lookups to its central servers to know where to put the dots. Vidalia
  32. caches answers it gets -- to reduce delay, to reduce overhead on
  33. the network, and to reduce anonymity issues where users reveal their
  34. knowledge about the network through which IP addresses they ask about.
  35. But with the advent of bridges, Tor clients are asking about IP
  36. addresses that aren't in the main directory. In particular, bridge
  37. users inform the central Vidalia servers about each bridge as they
  38. discover it and their Vidalia tries to map it.
  39. Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
  40. own IP address, so it can provide a more useful map.
  41. Finally, Vidalia's central servers leave users open to partitioning
  42. attacks, even if they can't target specific users. Further, as we
  43. start using GeoIP results for more operational or security-relevant
  44. goals, such as avoiding or including particular countries in circuits,
  45. it becomes more important that users can't be singled out in terms of
  46. their IP-to-country mapping beliefs.
  47. 2. The available GeoIP databases
  48. There are at least two classes of GeoIP database out there: "IP to
  49. country", which tells us the country code for the IP address but
  50. no more details, and "IP to city", which tells us the country code,
  51. the name of the city, and some basic latitude/longitude guesses.
  52. A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
  53. bytes. A typical line is:
  54. "205500992","208605279","US","USA","UNITED STATES"
  55. http://ip-to-country.webhosting.info/node/view/5
  56. Similarly, the maxmind GeoLite Country database is also about 500KB
  57. compressed.
  58. http://www.maxmind.com/app/geolitecountry
  59. The maxmind GeoLite City database gives more finegrained detail like
  60. geo coordinates and city name. Vidalia currently makes use of this
  61. information. On the other hand it's 16MB compressed. A typical line is:
  62. 206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
  63. http://www.maxmind.com/app/geolitecity
  64. There are other databases out there, like
  65. http://www.hostip.info/faq.html
  66. http://www.webconfs.com/ip-to-city.php
  67. that want more attention, but for now let's assume that all the db's
  68. are around this size.
  69. 3. What we'd like to solve
  70. Goal #1a: Tor relays collect IP-to-country user stats and publish
  71. sanitized versions.
  72. Goal #1b: Tor bridges collect IP-to-country user stats and publish
  73. sanitized versions.
  74. Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
  75. mapping.
  76. Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
  77. can pick countries for her paths.
  78. Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
  79. Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
  80. for better mapping.
  81. Goal #5: Reduce partitioning opportunities where Vidalia central
  82. servers can give different (distinguishing) responses.
  83. 4. Solution overview
  84. Our goal is to allow Tor relays, bridges, and clients to learn enough
  85. GeoIP information so they can do local private queries.
  86. 4.1. The IP-to-country db
  87. Directory authorities should publish a "geoip" file that contains
  88. IP-to-country mappings. Directory caches will mirror it, and Tor clients
  89. and relays (including bridge relays) will fetch it. Thus we can solve
  90. goals 1a and 1b (publish sanitized usage info). Controllers could also
  91. use this to solve goal 2b (choosing path by country attributes). It
  92. also solves goal 4 (learning the Tor client's country), though for
  93. huge countries like the US we'd still need to decide where the "middle"
  94. should be when we're mapping that address.
  95. The IP-to-country details are described further in Sections 5 and
  96. 6 below.
  97. [RESOLUTION: The geoip file in 0.2.0.x is not distributed through
  98. Tor. Instead, it is shipped with the bundle.]
  99. 4.2. The IP-to-city db
  100. In an ideal world, the IP-to-city db would be small enough that we
  101. could distribute it in the above manner too. But for now, it is too
  102. large. Here's where the design choice forks.
  103. Option A: Vidalia should continue doing its anonymized IP-to-city
  104. queries. Thus we can achieve goals 2a and 2b. We would solve goal
  105. 3 by only doing lookups on descriptors that are purpose "general"
  106. (see Section 4.2.1 for how). We would leave goal 5 unsolved.
  107. Option B: Each directory authority should keep an IP-to-city db,
  108. lookup the value for each router it lists, and include that line in
  109. the router's network-status entry. The network-status consensus would
  110. then use the line that appears in the majority of votes. This approach
  111. also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
  112. at all now), and goal 5 (reduced partitioning risks).
  113. Option B has the advantage that Vidalia can simplify its operation,
  114. and the advantage that this consensus IP-to-city data is available to
  115. other controllers besides just Vidalia. But it has the disadvantage
  116. that the networkstatus consensus becomes larger, even though most of
  117. the GeoIP information won't change from one consensus to the next. Is
  118. there another reasonable location for it that can provide similar
  119. consensus security properties?
  120. [RESOLUTION: IP-to-city is not supported.]
  121. 4.2.1. Controllers can query for router annotations
  122. Vidalia needs to stop doing queries on bridge relay IP addresses.
  123. It could do that by only doing lookups on descriptors that are in
  124. the networkstatus consensus, but that precludes designs like Blossom
  125. that might want to map its relay locations. The best answer is that it
  126. should learn the router annotations, with a new controller 'getinfo'
  127. command:
  128. "GETINFO desc-annotations/id/<OR identity>"
  129. which would respond with something like
  130. @downloaded-at 2007-11-29 08:06:38
  131. @source "128.31.0.34"
  132. @purpose bridge
  133. [We could also make the answer include the digest for the router in
  134. question, which would enable us to ask GETINFO router-annotations/all.
  135. Is this worth it? -RD]
  136. Then Vidalia can avoid doing lookups on descriptors with purpose
  137. "bridge". Even better would be to add a new annotation "@private true"
  138. so Vidalia can know how to handle new purposes that we haven't created
  139. yet. Vidalia could special-case "bridge" for now, for compatibility
  140. with the current 0.2.0.x-alphas.
  141. 4.3. Recommendation
  142. My overall recommendation is that we should implement 4.1 soon
  143. (e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
  144. with the hope that later we discover a better way to distribute the
  145. IP-to-city info and can switch to 4.2 option B.
  146. Below we discuss more how to go about achieving 4.1.
  147. 5. Publishing and caching the GeoIP (IP-to-country) database
  148. Each v3 directory authority should put a copy of the "geoip" file in
  149. its datadirectory. Then its network-status votes should include a hash
  150. of this file (Recommended-geoip-hash: %s), and the resulting consensus
  151. directory should specify the consensus hash.
  152. There should be a new URL for fetching this geoip db (by "current.z"
  153. for testing purposes, and by hash.z for typical downloads). Authorities
  154. should fetch and serve the one listed in the consensus, even when they
  155. vote for their own. This would argue for storing the cached version
  156. in a better filename than "geoip".
  157. Directory mirrors should keep a copy of this file available via the
  158. same URLs.
  159. We assume that the file would change at most a few times a month. Should
  160. Tor ship with a bootstrap geoip file? An out-of-date geoip file may
  161. open you up to partitioning attacks, but for the most part it won't
  162. be that different.
  163. There should be a config option to disable updating the geoip file,
  164. in case users want to use their own file (e.g. they have a proprietary
  165. GeoIP file they prefer to use). In that case we leave it up to the
  166. user to update his geoip file out-of-band.
  167. [XXX Should consider forward/backward compatibility, e.g. if we want
  168. to move to a new geoip file format. -RD]
  169. [RESOLUTION: Not done over Tor.]
  170. 6. Controllers use the IP-to-country db for mapping and for path building
  171. Down the road, Vidalia could use the IP-to-country mappings for placing
  172. on its map:
  173. - The location of the client
  174. - The location of the bridges, or other relays not in the
  175. networkstatus, on the map.
  176. - Any relays that it doesn't yet have an IP-to-city answer for.
  177. Other controllers can also use it to set EntryNodes, ExitNodes, etc
  178. in a per-country way.
  179. To support these features, we need to export the IP-to-country data
  180. via the Tor controller protocol.
  181. Is it sufficient just to add a new GETINFO command?
  182. GETINFO ip-to-country/128.31.0.34
  183. 250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
  184. [RESOLUTION: Not done now, except for the getinfo command.]
  185. 6.1. Other interfaces
  186. Robert Hogan has also suggested a
  187. GETINFO relays-by-country/cn
  188. as well as torrc options for ExitCountryCodes, EntryCountryCodes,
  189. ExcludeCountryCodes, etc.
  190. [RESOLUTION: Not implemented in 0.2.0.x. Fodder for a future proposal.]
  191. 7. Relays and bridges use the IP-to-country db for usage summaries
  192. Once bridges have a GeoIP database locally, they can start to publish
  193. sanitized summaries of client usage -- how many users they see and from
  194. what countries. This might also be a more useful way for ordinary Tor
  195. relays to convey the level of usage they see, which would allow us to
  196. switch to using directory guards for all users by default.
  197. But how to safely summarize this information without opening too many
  198. anonymity leaks?
  199. 7.1 Attacks to think about
  200. First, note that we need to have a large enough time window that we're
  201. not aiding correlation attacks much. I hope 24 hours is enough. So
  202. that means no publishing stats until you've been up at least 24 hours.
  203. And you can't publish follow-up stats more often than every 24 hours,
  204. or people could look at the differential.
  205. Second, note that we need to be sufficiently vague about the IP
  206. addresses we're reporting. We are hoping that just specifying the
  207. country will be vague enough. But a) what about active attacks where
  208. we convince a bridge to use a GeoIP db that labels each suspect IP
  209. address as a unique country? We have to assume that the consensus GeoIP
  210. db won't be malicious in this way. And b) could such singling-out
  211. attacks occur naturally, for example because of countries that have
  212. a very small IP space? We should investigate that.
  213. 7.2. Granularity of users
  214. Do we only want to report countries that have a sufficient anonymity set
  215. (that is, number of users) for the day? For example, we might avoid
  216. listing any countries that have seen less than five addresses over
  217. the 24 hour period. This approach would be helpful in reducing the
  218. singling-out opportunities -- in the extreme case, we could imagine a
  219. situation where one blogger from the Sudan used Tor on a given day, and
  220. we can discover which entry guard she used.
  221. But I fear that especially for bridges, seeing only one hit from a
  222. given country in a given day may be quite common.
  223. As a compromise, we should start out with an "Other" category in
  224. the reported stats, which is the sum of unlisted countries; if that
  225. category is consistently interesting, we can think harder about how
  226. to get the right data from it safely.
  227. But note that bridge summaries will not be made public individually,
  228. since doing so would help people enumerate bridges. Whereas summaries
  229. from normal relays will be public. So perhaps that means we can afford
  230. to be more specific in bridge summaries? In particular, I'm thinking the
  231. "other" category should be used by public relays but not for bridges
  232. (or if it is, used with a lower threshold).
  233. Even for countries that have many Tor users, we might not want to be
  234. too specific about how many users we've seen. For example, we might
  235. round down the number of users we report to the nearest multiple of 5.
  236. My instinct for now is that this won't be that useful.
  237. 7.3 Other issues
  238. Another note: we'll likely be overreporting in the case of users with
  239. dynamic IP addresses: if they rotate to a new address over the course
  240. of the day, we'll count them twice. So be it.
  241. 7.4. Where to publish the summaries?
  242. We designed extrainfo documents for information like this. So they
  243. should just be more entries in the extrainfo doc.
  244. But if we want to publish summaries every 24 hours (no more often,
  245. no less often), aren't we tried to the router descriptor publishing
  246. schedule? That is, if we publish a new router descriptor at the 18
  247. hour mark, and nothing much has changed at the 24 hour mark, won't
  248. the new descriptor get dropped as being "cosmetically similar", and
  249. then nobody will know to ask about the new extrainfo document?
  250. One solution would be to make and remember the 24 hour summary at the
  251. 24 hour mark, but not actually publish it anywhere until we happen to
  252. publish a new descriptor for other reasons. If we happen to go down
  253. before publishing a new descriptor, then so be it, at least we tried.
  254. 7.5. What if the relay is unreachable or goes to sleep?
  255. Even if you've been up for 24 hours, if you were hibernating for 18
  256. of them, then we're not getting as much fuzziness as we'd like. So
  257. I guess that means that we need a 24-hour period of being "awake"
  258. before we'll willing to publish a summary. A similar attack works if
  259. you've been awake but unreachable for the first 18 of the 24 hours. As
  260. another example, a bridge that's on a laptop might be suspended for
  261. some of each day.
  262. This implies that some relays and bridges will never publish summary
  263. stats, because they're not ever reliably working for 24 hours in
  264. a row. If a significant percentage of our reporters end up being in
  265. this boat, we should investigate whether we can accumulate 24 hours of
  266. "usefulness", even if there are holes in the middle, and publish based
  267. on that.
  268. What other issues are like this? It seems that just moving to a new
  269. IP address shouldn't be a reason to cancel stats publishing, assuming
  270. we were usable at each address.
  271. 7.6. IP addresses that aren't in the geoip db
  272. Some IP addresses aren't in the public geoip databases. In particular,
  273. I've found that a lot of African countries are missing, but there
  274. are also some common ones in the US that are missing, like parts of
  275. Comcast. We could just lump unknown IP addresses into the "other"
  276. category, but it might be useful to gather a general sense of how many
  277. lookups are failing entirely, by adding a separate "Unknown" category.
  278. We could also contribute back to the geoip db, by letting bridges set
  279. a config option to report the actual IP addresses that failed their
  280. lookup. Then the bridge authority operators can manually make sure
  281. the correct answer will be in later geoip files. This config option
  282. should be disabled by default.
  283. 7.7 Bringing it all together
  284. So here's the plan:
  285. 24 hours after starting up (modulo Section 7.5 above), bridges and
  286. relays should construct a daily summary of client countries they've
  287. seen, including the above "Unknown" category (Section 7.6) as well.
  288. Non-bridge relays lump all countries with less than K (e.g. K=5) users
  289. into the "Other" category (see Sec 7.2 above), whereas bridge relays are
  290. willing to list a country even when it has only one user for the day.
  291. Whenever we have a daily summary on record, we include it in our
  292. extrainfo document whenever we publish one. The daily summary we
  293. remember locally gets replaced with a newer one when another 24
  294. hours pass.
  295. 7.8. Some forward secrecy
  296. How should we remember addresses locally? If we convert them into
  297. country-codes immediately, we will count them again if we see them
  298. again. On the other hand, we don't really want to keep a list hanging
  299. around of all IP addresses we've seen in the past 24 hours.
  300. Step one is that we should never write this stuff to disk. Keeping it
  301. only in ram will make things somewhat better. Step two is to avoid
  302. keeping any timestamps associated with it: rather than a rolling
  303. 24-hour window, which would require us to remember the various times
  304. we've seen that address, we can instead just throw out the whole list
  305. every 24 hours and start over.
  306. We could hash the addresses, and then compare hashes when deciding if
  307. we've seen a given address before. We could even do keyed hashes. Or
  308. Bloom filters. But if our goal is to defend against an adversary
  309. who steals a copy of our ram while we're running and then does
  310. guess-and-check on whatever blob we're keeping, we're in bad shape.
  311. We could drop the last octet of the IP address as soon as we see
  312. it. That would cause us to undercount some users from cablemodem and
  313. DSL networks that have a high density of Tor users. And it wouldn't
  314. really help that much -- indeed, the extent to which it does help is
  315. exactly the extent to which it makes our stats less useful.
  316. Other ideas?