123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235 |
- Filename: 126-geoip-fetching.txt
- Title: Getting GeoIP data and publishing usage summaries
- Version: $Revision: 11988 $
- Last-Modified: $Date: 2007-10-16 12:59:42 -0400 (Tue, 16 Oct 2007) $
- Author: Roger Dingledine
- Created: 2007-11-24
- Status: Needs-Research
- 1. Background and motivation
- Right now we can keep a rough count of Tor users, both total and by
- country, by watching connections to a single directory mirror. Being
- able to get usage estimates is useful both for our funders (to
- demonstrate progress) and for our own development (so we know how
- quickly we're scaling and can design accordingly, and so we know which
- countries and communities to focus on more). This need for information
- is the only reason we haven't deployed "directory guards" (think of
- them like entry guards but for directory information; in practice,
- it would seem that Tor clients should simply use their entry guards
- as their directory guards; see also proposal 125).
- With the move toward bridges, we will no longer be able to track Tor
- clients that use bridges, since they use their bridges as directory
- guards. Further, we need to be able to learn which bridges stop seeing
- use from certain countries (and are thus likely blocked), so we can
- avoid giving them out to other users in those countries.
- Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
- and circuits on its 'network map', and it performs anonymized GeoIP
- lookups to its central servers to know where to put the dots. Vidalia
- caches answers it gets -- to reduce delay, to reduce overhead on
- the network, and to reduce anonymity issues where users reveal their
- knowledge about the network through which IP addresses they ask about.
- But with the advent of bridges, Tor clients are asking about IP
- addresses that aren't in the main directory. In particular, bridge
- users inform the central Vidalia servers about each bridge as they
- discover it and their Vidalia tries to map it.
- Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
- own IP address, so it can provide a more useful map.
- Finally, Vidalia's central servers leave users open to partitioning
- attacks, even if they can't target specific users. Further, as we
- start using GeoIP results for more operational or security-relevant
- goals, such as avoiding or including particular countries in circuits,
- it becomes more important that users can't be singled out in terms of
- their IP-to-country mapping beliefs.
- 2. The available GeoIP databases
- There are at least two classes of GeoIP database out there: "IP to
- country", which tells us the country code for the IP address but
- no more details, and "IP to city", which tells us the country code,
- the name of the city, and some basic latitude/longitude guesses.
- A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
- bytes. A typical line is:
- "205500992","208605279","US","USA","UNITED STATES"
- http://ip-to-country.webhosting.info/node/view/5
- Similarly, the maxmind GeoLite Country database is also about 500KB
- compressed.
- http://www.maxmind.com/app/geolitecountry
- The maxmind GeoLite City database gives more finegrained detail like
- as geo coordinates and city name. Vidalia currently makes use of this
- information. On the other hand it's 16MB compressed. A typical line is:
- 206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
- http://www.maxmind.com/app/geolitecity
- There are other databases out there, like
- http://www.hostip.info/faq.html
- http://www.webconfs.com/ip-to-city.php
- that want more attention, but for now let's assume that all the db's
- are around this size.
- 3. What we'd like to solve
- Goal #1a: Tor relays collect IP-to-country user stats and publish
- sanitized versions.
- Goal #1b: Tor bridges collect IP-to-country user stats and publish
- sanitized versions.
- Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
- mapping.
- Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
- can pick countries for her paths.
- Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
- Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
- for better mapping.
- Goal #5: Reduce partitioning opportunities where Vidalia central
- servers can give different (distinguishing) responses.
- 4. Solution overview
- Our goal is to allow Tor relays, bridges, and clients to learn enough
- GeoIP information so they can do local private queries.
- 4.1. The IP-to-country db
- Directory authorities should publish a "geoip" file that contains
- IP-to-country mappings. Directory caches will mirror it, and Tor clients
- and relays (including bridge relays) will fetch it. Thus we can solve
- goals 1a and 1b (publish sanitized usage info). Controllers could also
- use this to solve goal 2b (choosing path by country attributes). It
- also solves goal 4 (learning the Tor client's country), though for
- huge countries like the US we'd still need to decide where the "middle"
- should be when we're mapping that address.
- The IP-to-country details are described further in Sections 5 and
- 6 below.
- 4.2. The IP-to-city db
- In an ideal world, the IP-to-city db would be small enough that we
- could distribute it in the above manner too. But for now, it is too
- large. Here's where the design choice forks.
- Option A: Vidalia should continue doing its anonymized IP-to-city
- queries. Thus we can achieve goals 2a and 2b. We would solve goal
- 3 by only doing lookups on descriptors that are purpose "general"
- (see Section 4.2.1 for how). We would leave goal 5 unsolved.
- Option B: Each directory authority should keep an IP-to-city db,
- lookup the value for each router it lists, and include that line in
- the router's network-status entry. The network-status consensus would
- then use the line that appears in the majority of votes. This approach
- also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
- at all now), and goal 5 (reduced partitioning risks).
- Option B has the advantage that Vidalia can simplify its operation,
- and the advantage that this consensus IP-to-city data is available to
- other controllers besides just Vidalia. But it has the disadvantage
- that the networkstatus consensus becomes larger, even though most of
- the GeoIP information won't change from one consensus to the next. Is
- there another reasonable location for it that can provide similar
- consensus security properties?
- 4.2.1. Controllers can query for router annotations
- Vidalia needs to stop doing queries on bridge relay IP addresses.
- It could do that by only doing lookups on descriptors that are in
- the networkstatus consensus, but that precludes designs like Blossom
- that might want to map its relay locations. The best answer is that it
- should learn the router annotations, with a new controller 'getinfo'
- command:
- "GETINFO router-annotations/id/<OR identity>" or
- "GETINFO router-annotations/name/<OR nickname>"
- which would respond with something like
- @downloaded-at 2007-11-29 08:06:38
- @source "128.31.0.34"
- @purpose bridge
- [We could also make the answer include the digest for the router in
- question, which would enable us to ask GETINFO router-annotations/all.
- Is this worth it? -RD]
- Then Vidalia can avoid doing lookups on descriptors with purpose
- "bridge". Even better would be to add a new annotation "@private true"
- so Vidalia can know how to handle new purposes that we haven't created
- yet. Vidalia could special-case "bridge" for now, for compatibility
- with the current 0.2.0.x-alphas.
- 4.3. Recommendation
- My overall recommendation is that we should implement 4.1 soon
- (e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
- with the hope that later we discover a better way to distribute the
- IP-to-city info and can switch to 4.2 option B.
- Below we discuss more how to go about achieving 4.1.
- 5. Publishing and caching the GeoIP (IP-to-country) database
- Each v3 directory authority should put a copy of the "geoip" file in
- its datadirectory. Then its network-status votes should include a hash
- of this file (Recommended-geoip-hash: %s), and the resulting consensus
- directory should specify the consensus hash.
- There should be a new URL for fetching this geoip db (by "current.z"
- for testing purposes, and by hash.z for typical downloads). Authorities
- should fetch and serve the one listed in the consensus, even when they
- vote for their own. This would argue for storing the cached version
- in a better filename than "geoip".
- Directory mirrors should keep a copy of this file available via the
- same URLs.
- We assume that the file would change at most a few times a month. Should
- Tor ship with a bootstrap geoip file? An out-of-date geoip file may
- open you up to partitioning attacks, but for the most part it won't
- be that different.
- There should be a config option to disable updating the geoip file,
- in case users want to use their own file (e.g. they have a proprietary
- GeoIP file they prefer to use). In that case we leave it up to the
- user to update his geoip file out-of-band.
- [XXX Should consider forward/backward compatibility, e.g. if we want
- to move to a new geoip file format. -RD]
- 6. Controllers use the IP-to-country db for mapping and for path building
- Down the road, vidalia can use the IP-to-country mappings for placing
- on its map:
- - The location of the client
- - The location of the bridges, or other relays not in the
- networkstatus, on the map.
- - Any relays that it doesn't yet have an IP-to-city answer for.
- Other controllers can also use it to set EntryNodes, ExitNodes, etc
- in a per-country way.
- To support these features, we need to export the IP-to-country data
- via the Tor controller protocol.
- Is it sufficient just to add a new GETINFO command?
- GETINFO ip-to-country/128.31.0.34
- 250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
- 7. Relays and bridges use the IP-to-country db for usage summaries
- Once bridges have a GeoIP database locally, they can start to publish
- sanitized summaries of client usage -- how many users they see and from
- what countries. This might also be a more useful way for ordinary Tor
- relays to convey the level of usage they see, which would allow us to
- switch to using directory guards for all users by default.
- But how to safely summarize this information without opening too many
- anonymity leaks seems hard...
|