18 years ago · 628697acfa
--- a/doc/spec/proposals/126-geoip-reporting.txt
+++ b/doc/spec/proposals/126-geoip-reporting.txt
@@ -205,7 +205,7 @@ Status: Needs-Research
 
															 6. Controllers use the IP-to-country db for mapping and for path building
														
 
															-  Down the road, vidalia can use the IP-to-country mappings for placing
														
 
															+  Down the road, Vidalia could use the IP-to-country mappings for placing
														
 
															   on its map:
														
 
															   - The location of the client
														
 
															   - The location of the bridges, or other relays not in the
														
@@ -222,6 +222,14 @@ Status: Needs-Research
 
															     GETINFO ip-to-country/128.31.0.34
														
 
															     250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
														
 
															+6.1. Other interfaces
														
 
															+
														
 
															+  Robert Hogan has also suggested a
														
 
															+    GETINFO relays-by-country/cn
														
 
															+
														
 
															+  as well as torrc options for ExitCountryCodes, EntryCountryCodes,
														
 
															+  ExcludeCountryCodes, etc.
														
 
															+
														
 
															 7. Relays and bridges use the IP-to-country db for usage summaries
														
 
															   Once bridges have a GeoIP database locally, they can start to publish
														
@@ -231,5 +239,156 @@ Status: Needs-Research
 
															   switch to using directory guards for all users by default.
														
 
															   But how to safely summarize this information without opening too many
														
 
															-  anonymity leaks seems hard...
														
 
															+  anonymity leaks?
														
 
															+
														
 
															+7.1 Attacks to think about
														
 
															+
														
 
															+  First, note that we need to have a large enough time window that we're
														
 
															+  not aiding correlation attacks much. I hope 24 hours is enough. So
														
 
															+  that means no publishing stats until you've been up at least 24 hours.
														
 
															+  And you can't publish follow-up stats more often than every 24 hours,
														
 
															+  or people could look at the differential.
														
 
															+
														
 
															+  Second, note that we need to be sufficiently vague about the IP
														
 
															+  addresses we're reporting. We are hoping that just specifying the
														
 
															+  country will be vague enough. But a) what about active attacks where
														
 
															+  we convince a bridge to use a GeoIP db that labels each suspect IP
														
 
															+  address as a unique country? We have to assume that the consensus GeoIP
														
 
															+  db won't be malicious in this way. And b) could such singling-out
														
 
															+  attacks occur naturally, for example because of countries that have
														
 
															+  a very small IP space? We should investigate that.
														
 
															+
														
 
															+7.2. Granularity of users
														
 
															+
														
 
															+  Do we only want to report countries that have a very small anonymity set
														
 
															+  (that is, number of users) for the day? For example, we might avoid
														
 
															+  listing any countries that have seen less than five addresses over
														
 
															+  the 24 hour period. This approach would be helpful in reducing the
														
 
															+  singling-out opportunities -- in the extreme case, we could imagine a
														
 
															+  situation where one blogger from the Sudan used Tor on a given day, and
														
 
															+  we can discover which entry guard she used.
														
 
															+
														
 
															+  But I fear that especially for bridges, seeing only one hit from a
														
 
															+  given country in a given day may be quite common.
														
 
															+
														
 
															+  As a compromise, we should start out with an "Other" category in
														
 
															+  the reported stats, which is the sum of unlisted countries; if that
														
 
															+  category is consistently interesting, we can think harder about how
														
 
															+  to get the right data from it safely.
														
 
															+
														
 
															+  But note that bridge summaries will not be made public individually,
														
 
															+  since doing so would help people enumerate bridges. Whereas summaries
														
 
															+  from normal relays will be public. So perhaps that means we can afford
														
 
															+  to be more specific in bridge summaries? In particular, I'm thinking the
														
 
															+  "other" category should be used by public relays but not for bridges
														
 
															+  (or if it is, used with a lower threshold).
														
 
															+
														
 
															+  Even for countries that have many Tor users, we might not want to be
														
 
															+  too specific about how many users we've seen. For example, we might
														
 
															+  round down the number of users we report to the nearest multiple of 5.
														
 
															+  My instinct for now is that this won't be that useful.
														
 
															+
														
 
															+7.3 Other issues
														
 
															+
														
 
															+  Another note: we'll likely be overreporting in the case of users with
														
 
															+  dynamic IP addresses: if they rotate to a new address over the course
														
 
															+  of the day, we'll count them twice. So be it.
														
 
															+
														
 
															+7.4. Where to publish the summaries?
														
 
															+
														
 
															+  We designed extrainfo documents for information like this. So they
														
 
															+  should just be more entries in the extrainfo doc.
														
 
															+
														
 
															+  But if we want to publish summaries every 24 hours (no more often,
														
 
															+  no less often), aren't we tried to the router descriptor publishing
														
 
															+  schedule? That is, if we publish a new router descriptor at the 18
														
 
															+  hour mark, and nothing much has changed at the 24 hour mark, won't
														
 
															+  the new descriptor get dropped as being "cosmetically similar", and
														
 
															+  then nobody will know to ask about the new extrainfo document?
														
 
															+
														
 
															+  One solution would be to make and remember the 24 hour summary at the
														
 
															+  24 hour mark, but not actually publish it anywhere until we happen to
														
 
															+  publish a new descriptor for other reasons. If we happen to go down
														
 
															+  before publishing a new descriptor, then so be it, at least we tried.
														
 
															+
														
 
															+7.5. What if the relay is unreachable or goes to sleep?
														
 
															+
														
 
															+  Even if you've been up for 24 hours, if you were hibernating for 18
														
 
															+  of them, then we're not getting as much fuzziness as we'd like. So
														
 
															+  I guess that means that we need a 24-hour period of being "awake"
														
 
															+  before we'll willing to publish a summary. A similar attack works if
														
 
															+  you've been awake but unreachable for the first 18 of the 24 hours. As
														
 
															+  another example, a bridge that's on a laptop might be suspended for
														
 
															+  some of each day.
														
 
															+
														
 
															+  This implies that some relays and bridges will never publish summary
														
 
															+  stats, because they're not ever reliably working for 24 hours in
														
 
															+  a row. If a significant percentage of our reporters end up being in
														
 
															+  this boat, we should investigate whether we can accumulate 24 hours of
														
 
															+  "usefulness", even if there are holes in the middle, and publish based
														
 
															+  on that.
														
 
															+
														
 
															+  What other issues are like this? It seems that just moving to a new
														
 
															+  IP address shouldn't be a reason to cancel stats publishing, assuming
														
 
															+  we were usable at each address.
														
 
															+
														
 
															+7.6. IP addresses that aren't in the geoip db
														
 
															+
														
 
															+  Some IP addresses aren't in the public geoip databases. In particular,
														
 
															+  I've found that a lot of African countries are missing, but there
														
 
															+  are also some common ones in the US that are missing, like parts of
														
 
															+  Comcast. We could just lump unknown IP addresses into the "other"
														
 
															+  category, but it might be useful to gather a general sense of how many
														
 
															+  lookups are failing entirely, by adding a separate "Unknown" category.
														
 
															+
														
 
															+  We could also contribute back to the geoip db, by letting bridges set
														
 
															+  a config option to report the actual IP addresses that failed their
														
 
															+  lookup. Then the bridge authority operators can manually make sure
														
 
															+  the correct answer will be in later geoip files. This config option
														
 
															+  should be disabled by default.
														
 
															+
														
 
															+7.7 Bringing it all together
														
 
															+
														
 
															+  So here's the plan:
														
 
															+
														
 
															+  24 hours after starting up (modulo Section 7.5 above), bridges and
														
 
															+  relays should construct a daily summary of client countries they've
														
 
															+  seen, including the above "Unknown" category (Section 7.6) as well.
														
 
															+
														
 
															+  Non-bridge relays lump all countries with less than K (e.g. K=5) users
														
 
															+  into the "Other" category (see Sec 7.2 above), whereas bridge relays are
														
 
															+  willing to list a country even when it has only one user for the day.
														
 
															+
														
 
															+  Whenever we have a daily summary on record, we include it in our
														
 
															+  extrainfo document whenever we publish one. The daily summary we
														
 
															+  remember locally gets replaced with a newer one when another 24
														
 
															+  hours pass.
														
 
															+
														
 
															+7.8. Some forward secrecy
														
 
															+
														
 
															+  How should we remember addresses locally? If we convert them into
														
 
															+  country-codes immediately, we will count them again if we see them
														
 
															+  again. On the other hand, we don't really want to keep a list hanging
														
 
															+  around of all IP addresses we've seen in the past 24 hours.
														
 
															+
														
 
															+  Step one is that we should never write this stuff to disk. Keeping it
														
 
															+  only in ram will make things somewhat better. Step two is to avoid
														
 
															+  keeping any timestamps associated with it: rather than a rolling
														
 
															+  24-hour window, which would require us to remember the various times
														
 
															+  we've seen that address, we can instead just throw out the whole list
														
 
															+  every 24 hours and start over.
														
 
															+
														
 
															+  We could hash the addresses, and then compare hashes when deciding if
														
 
															+  we've seen a given address before. We could even do keyed hashes. Or
														
 
															+  Bloom filters. But if our goal is to defend against an adversary
														
 
															+  who steals a copy of our ram while we're running and then does
														
 
															+  guess-and-check on whatever blob we're keeping, we're in bad shape.
														
 
															+
														
 
															+  We could drop the last octet of the IP address as soon as we see
														
 
															+  it. That would cause us to undercount some users from cablemodem and
														
 
															+  DSL networks that have a high density of Tor users. And it wouldn't
														
 
															+  really help that much -- indeed, the extent to which it does help is
														
 
															+  exactly the extent to which it makes our stats less useful.
														
 
															+
														
 
															+  Other ideas?