|
@@ -1,10 +1,10 @@
|
|
Filename: 126-geoip-fetching.txt
|
|
Filename: 126-geoip-fetching.txt
|
|
-Title: Fetching GeoIP databases for clients, relays, and bridges
|
|
+Title: Getting GeoIP data and publishing usage summaries
|
|
Version: $Revision: 11988 $
|
|
Version: $Revision: 11988 $
|
|
Last-Modified: $Date: 2007-10-16 12:59:42 -0400 (Tue, 16 Oct 2007) $
|
|
Last-Modified: $Date: 2007-10-16 12:59:42 -0400 (Tue, 16 Oct 2007) $
|
|
Author: Roger Dingledine
|
|
Author: Roger Dingledine
|
|
Created: 2007-11-24
|
|
Created: 2007-11-24
|
|
-Status: Open
|
|
+Status: Researching
|
|
|
|
|
|
1. Background and motivation
|
|
1. Background and motivation
|
|
|
|
|
|
@@ -17,7 +17,7 @@ Status: Open
|
|
is the only reason we haven't deployed "directory guards" (think of
|
|
is the only reason we haven't deployed "directory guards" (think of
|
|
them like entry guards but for directory information; in practice,
|
|
them like entry guards but for directory information; in practice,
|
|
it would seem that Tor clients should simply use their entry guards
|
|
it would seem that Tor clients should simply use their entry guards
|
|
- as their directory guards).
|
|
+ as their directory guards; see also proposal 125).
|
|
|
|
|
|
With the move toward bridges, we will no longer be able to track Tor
|
|
With the move toward bridges, we will no longer be able to track Tor
|
|
clients that use bridges, since they use their bridges as directory
|
|
clients that use bridges, since they use their bridges as directory
|
|
@@ -25,40 +25,137 @@ Status: Open
|
|
use from certain countries (and are thus likely blocked), so we can
|
|
use from certain countries (and are thus likely blocked), so we can
|
|
avoid giving them out to other users in those countries.
|
|
avoid giving them out to other users in those countries.
|
|
|
|
|
|
- Right now we support GeoIP lookups through Vidalia: Vidalia draws relays
|
|
+ Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
|
|
and circuits on its 'network map', and it performs anonymized GeoIP
|
|
and circuits on its 'network map', and it performs anonymized GeoIP
|
|
lookups to its central servers to know where to put the dots. Vidalia
|
|
lookups to its central servers to know where to put the dots. Vidalia
|
|
caches answers it gets -- to reduce delay, to reduce overhead on
|
|
caches answers it gets -- to reduce delay, to reduce overhead on
|
|
the network, and to reduce anonymity issues where users reveal their
|
|
the network, and to reduce anonymity issues where users reveal their
|
|
- behavior through which IP addresses they ask about.
|
|
+ knowledge about the network through which IP addresses they ask about.
|
|
|
|
|
|
But with the advent of bridges, Tor clients are asking about IP
|
|
But with the advent of bridges, Tor clients are asking about IP
|
|
addresses that aren't in the main directory. In particular, bridge
|
|
addresses that aren't in the main directory. In particular, bridge
|
|
- users tell the central Vidalia servers about each bridge as they
|
|
+ users inform the central Vidalia servers about each bridge as they
|
|
discover it and their Vidalia tries to map it.
|
|
discover it and their Vidalia tries to map it.
|
|
|
|
|
|
Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
|
|
Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
|
|
own IP address, so it can provide a more useful map.
|
|
own IP address, so it can provide a more useful map.
|
|
|
|
|
|
- Also, Vidalia's central servers leave users open to partitioning
|
|
+ Finally, Vidalia's central servers leave users open to partitioning
|
|
attacks, even if they can't target specific users. Further, as we
|
|
attacks, even if they can't target specific users. Further, as we
|
|
start using GeoIP results for more operational or security-relevant
|
|
start using GeoIP results for more operational or security-relevant
|
|
goals, such as avoiding or including particular countries in circuits,
|
|
goals, such as avoiding or including particular countries in circuits,
|
|
it becomes more important that users can't be singled out in terms of
|
|
it becomes more important that users can't be singled out in terms of
|
|
their IP-to-country mapping beliefs.
|
|
their IP-to-country mapping beliefs.
|
|
|
|
|
|
- This proposal describes a way for Tor relays, bridges, and clients to
|
|
+2. The available GeoIP databases
|
|
- download a local copy of a GeoIP database, so they can do local private
|
|
|
|
- queries. Thus we can avoid sending detailed queries to central servers.
|
|
|
|
|
|
|
|
-2. Publishing and caching the GeoIP database
|
|
+ There are at least two classes of GeoIP database out there: "IP to
|
|
|
|
+ country", which tells us the country code for the IP address but
|
|
|
|
+ no more details, and "IP to city", which tells us the country code,
|
|
|
|
+ the name of the city, and some basic latitude/longitude guesses.
|
|
|
|
|
|
- We assume that we use a free GeoIP db, like ip2country. We will need
|
|
+ A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
|
|
- to standardize on its format; see Section 5.
|
|
+ bytes. A typical line is:
|
|
|
|
+ "205500992","208605279","US","USA","UNITED STATES"
|
|
|
|
+ http://ip-to-country.webhosting.info/node/view/5
|
|
|
|
+
|
|
|
|
+ Similarly, the maxmind GeoLite Country database is also about 500KB
|
|
|
|
+ compressed.
|
|
|
|
+ http://www.maxmind.com/app/geolitecountry
|
|
|
|
+
|
|
|
|
+ The maxmind GeoLite City database gives more finegrained detail like
|
|
|
|
+ as geo coordinates and city name. Vidalia currently makes use of this
|
|
|
|
+ information. On the other hand it's 16MB compressed. A typical line is:
|
|
|
|
+ 206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
|
|
|
|
+ http://www.maxmind.com/app/geolitecity
|
|
|
|
+
|
|
|
|
+ There are other databases out there, like
|
|
|
|
+ http://www.hostip.info/faq.html
|
|
|
|
+ http://www.webconfs.com/ip-to-city.php
|
|
|
|
+ that want more attention, but for now let's assume that all the db's
|
|
|
|
+ are around this size.
|
|
|
|
+
|
|
|
|
+3. What we'd like to solve
|
|
|
|
+
|
|
|
|
+ Goal #1a: Tor relays collect IP-to-country user stats and publish
|
|
|
|
+ sanitized versions.
|
|
|
|
+ Goal #1b: Tor bridges collect IP-to-country user stats and publish
|
|
|
|
+ sanitized versions.
|
|
|
|
+
|
|
|
|
+ Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
|
|
|
|
+ mapping.
|
|
|
|
+ Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
|
|
|
|
+ can pick countries for her paths.
|
|
|
|
+
|
|
|
|
+ Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
|
|
|
|
+
|
|
|
|
+ Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
|
|
|
|
+ for better mapping.
|
|
|
|
+
|
|
|
|
+ Goal #5: Reduce partitioning opportunities where Vidalia central
|
|
|
|
+ servers can give different (distinguishing) responses.
|
|
|
|
+
|
|
|
|
+4. Solution overview
|
|
|
|
+
|
|
|
|
+ Our goal is to allow Tor relays, bridges, and clients to learn enough
|
|
|
|
+ GeoIP information so they can do local private queries.
|
|
|
|
+
|
|
|
|
+4.1. The IP-to-country db
|
|
|
|
+
|
|
|
|
+ Directory authorities should publish a "geoip" file that contains
|
|
|
|
+ IP-to-country mappings. Directory caches will mirror it, and Tor clients
|
|
|
|
+ and relays (including bridge relays) will fetch it. Thus we can solve
|
|
|
|
+ goals 1a and 1b (publish sanitized usage info). Controllers could also
|
|
|
|
+ use this to solve goal 2b (choosing path by country attributes). It
|
|
|
|
+ also solves goal 4 (learning the Tor client's country), though for
|
|
|
|
+ huge countries like the US we'd still need to decide where the "middle"
|
|
|
|
+ should be when we're mapping that address.
|
|
|
|
+
|
|
|
|
+ The IP-to-country details are described further in Sections 5 and
|
|
|
|
+ 6 below.
|
|
|
|
+
|
|
|
|
+4.2. The IP-to-city db
|
|
|
|
+
|
|
|
|
+ In an ideal world, the IP-to-city db would be small enough that we
|
|
|
|
+ could distribute it in the above manner too. But for now, it is too
|
|
|
|
+ large. Here's where the design choice forks.
|
|
|
|
+
|
|
|
|
+ Option A: Vidalia should continue doing its anonymized IP-to-city
|
|
|
|
+ queries. Thus we can achieve goals 2a and 2b. We would solve goal
|
|
|
|
+ 3 by only doing lookups on descriptors that are purpose "general"
|
|
|
|
+ (or, alternately, by only doing lookups on descriptors that are in
|
|
|
|
+ the networkstatus consensus). We would leave goal 5 unsolved.
|
|
|
|
+
|
|
|
|
+ Option B: Each directory authority should keep an IP-to-city db,
|
|
|
|
+ lookup the value for each router it lists, and include that line in
|
|
|
|
+ the router's network-status entry. The network-status consensus would
|
|
|
|
+ then use the line that appears in the majority of votes. This approach
|
|
|
|
+ also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
|
|
|
|
+ at all now), and goal 5 (reduced partitioning risks).
|
|
|
|
+
|
|
|
|
+ Option B has the advantage that Vidalia can simplify its operation,
|
|
|
|
+ and the advantage that this consensus IP-to-city data is available to
|
|
|
|
+ other controllers besides just Vidalia. But it has the disadvantage
|
|
|
|
+ that the networkstatus consensus becomes larger, even though most of
|
|
|
|
+ the GeoIP information won't change from one consensus to the next. Is
|
|
|
|
+ there another reasonable location for it that can provide similar
|
|
|
|
+ consensus security properties?
|
|
|
|
+
|
|
|
|
+4.3. Recommendation
|
|
|
|
+
|
|
|
|
+ My overall recommendation is that we should implement 4.1 soon
|
|
|
|
+ (e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
|
|
|
|
+ with the hope that later we discover a better way to distribute the
|
|
|
|
+ IP-to-city info and can switch to 4.2 option B.
|
|
|
|
+
|
|
|
|
+ Below we discuss more how to go about achieving 4.1.
|
|
|
|
+
|
|
|
|
+5. Publishing and caching the GeoIP (IP-to-country) database
|
|
|
|
|
|
Each v3 directory authority should put a copy of the "geoip" file in
|
|
Each v3 directory authority should put a copy of the "geoip" file in
|
|
- its datadirectory. Then its votes should include a hash of this file,
|
|
+ its datadirectory. Then its network-status votes should include a hash
|
|
- and the resulting consensus directory should specify the consensus hash.
|
|
+ of this file (Recommended-geoip-hash: %s), and the resulting consensus
|
|
|
|
+ directory should specify the consensus hash.
|
|
|
|
|
|
There should be a new URL for fetching this geoip db (by "current.z"
|
|
There should be a new URL for fetching this geoip db (by "current.z"
|
|
for testing purposes, and by hash.z for typical downloads). Authorities
|
|
for testing purposes, and by hash.z for typical downloads). Authorities
|
|
@@ -70,55 +167,42 @@ Status: Open
|
|
same URLs.
|
|
same URLs.
|
|
|
|
|
|
We assume that the file would change at most a few times a month. Should
|
|
We assume that the file would change at most a few times a month. Should
|
|
- Tor ship with a bootstrap geoip file?
|
|
+ Tor ship with a bootstrap geoip file? An out-of-date geoip file may
|
|
-
|
|
+ open you up to partitioning attacks, but for the most part it won't
|
|
-3. Clients use it for Vidalia
|
|
+ be that different.
|
|
-
|
|
|
|
- Tor fetches the geoip file as above, and puts it in Tor's DataDirectory.
|
|
|
|
- Then we could have a status event that tells controllers that a new
|
|
|
|
- geoip file has arrived.
|
|
|
|
-
|
|
|
|
- Then Vidalia would either read the file directly, or we would add
|
|
|
|
- a control protocol interface for querying. Since Tor probably needs
|
|
|
|
- to parse the file itself (see Section 4 below), offering the control
|
|
|
|
- interface is probably cleanest.
|
|
|
|
|
|
|
|
There should be a config option to disable updating the geoip file,
|
|
There should be a config option to disable updating the geoip file,
|
|
in case users want to use their own file (e.g. they have a proprietary
|
|
in case users want to use their own file (e.g. they have a proprietary
|
|
GeoIP file they prefer to use). In that case we leave it up to the
|
|
GeoIP file they prefer to use). In that case we leave it up to the
|
|
user to update his geoip file out-of-band.
|
|
user to update his geoip file out-of-band.
|
|
|
|
|
|
-4. Bridges use it for usage summaries
|
|
+ [XXX Should consider forward/backward compatibility, e.g. if we want
|
|
|
|
+ to move to a new geoip file format. -RD]
|
|
|
|
|
|
- Once bridges have a GeoIP database locally, they can start to publish
|
|
+6. Controllers use the IP-to-country db for mapping and for path building
|
|
- sanitized summaries of client usage -- how many users they see and from
|
|
|
|
- what countries. This might also be a more useful way for ordinary Tor
|
|
|
|
- relays to convey the level of usage they see.
|
|
|
|
|
|
|
|
- But how to safely summarize this information without opening too many
|
|
+ Vidalia can use the IP-to-country mappings for placing on its map:
|
|
- anonymity leaks seems hard, so I'm going to leave it for a different
|
|
+ - The location of the client
|
|
- proposal.
|
|
+ - The location of the bridges, or other relays not in the
|
|
|
|
+ networkstatus, on the map.
|
|
|
|
+ - Any relays that it doesn't yet have an IP-to-city answer for.
|
|
|
|
|
|
-5. Which db to use?
|
|
+ Controllers can also it to set EntryNodes, ExitNodes, etc in a
|
|
|
|
+ per-country way. To support this feature, we need to export the
|
|
|
|
+ IP-to-country data via the Tor controller protocol.
|
|
|
|
|
|
- A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
|
|
+ Is it sufficient just to add a new GETINFO command:
|
|
- bytes. This isn't so bad. But we can easily cut it down further; some
|
|
+ GETINFO ip-to-country/128.31.0.34
|
|
- sample lines are:
|
|
+ 250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
|
|
- "205500992","208605279","US","USA","UNITED STATES"
|
|
|
|
- "208605280","208605311","CA","CAN","CANADA"
|
|
|
|
- "208605312","210784255","US","USA","UNITED STATES"
|
|
|
|
- My guess is the compression will solve most of the redundancy, so we
|
|
|
|
- can stick with the default format.
|
|
|
|
- http://ip-to-country.webhosting.info/node/view/5
|
|
|
|
|
|
|
|
- The maxmind GeoLite Country database is also about 500KB compressed.
|
|
+7. Relays and bridges use the IP-to-country db for usage summaries
|
|
- http://www.maxmind.com/app/geolitecountry
|
|
|
|
|
|
|
|
- The maxmind GeoLite City database gives more finegrained detail, such
|
|
+ Once bridges have a GeoIP database locally, they can start to publish
|
|
- as geo coordinates and city name. Vidalia currently makes use of this
|
|
+ sanitized summaries of client usage -- how many users they see and from
|
|
- information. On the other hand it's 16MB compressed, which would seem
|
|
+ what countries. This might also be a more useful way for ordinary Tor
|
|
- to be out of our reach.
|
|
+ relays to convey the level of usage they see, which would allow us to
|
|
- http://www.maxmind.com/app/geolitecity
|
|
+ switch to using directory guards for all users by default.
|
|
|
|
|
|
- What other options are there?
|
|
+ But how to safely summarize this information without opening too many
|
|
|
|
+ anonymity leaks seems hard...
|
|
|
|
|