xxx-exit-scanning-outline.txt 2.2 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344
  1. 1. Scanning process
  2. A. Non-HTML/JS HTTP mime types compared via SHA1 hash
  3. B. Dynamic HTTP content filtered at 4 levels:
  4. 1. IP change+Tor cookie utilization
  5. - Tor cookies replayed with new IP in case of changes
  6. 2. HTML Tag+Attribute+JS comparison
  7. - Comparisons made based only on "relevant" HTML tags
  8. and attributes
  9. 3. HTML Tag+Attribute+JS diffing
  10. - Tags, attributes and JS AST nodes that change during
  11. Non-Tor fetches pruned from comparison
  12. 4. URLS with > N% of node failures removed
  13. - results purged from filesystem at end of scan loop
  14. C. SSL scanning handles some forms of dynamic certs
  15. 1. Catalogs certs for all IPs resolved locally
  16. by getaddrinfo over the duration of the scan.
  17. - Updated each test.
  18. 2. If the domain presents a new cert for each IP, this
  19. is noted on the failure result for the node
  20. 3. If the same IP presents two different certs locally,
  21. the cert list is first refreshed, and if it happens
  22. again, discarded
  23. 4. A N% node failure filter also applies
  24. D. Scanner can be restarted from any point in the event
  25. of scanner or system crashes, or graceful shutdown.
  26. - Results+scan state pickled to filesystem continuously
  27. 2. Cron job checks results periodically for reporting
  28. A. Divide failures into three types of BadExit based on type
  29. and frequency over time and incident rate
  30. B. write reject lines to approved-routers for those three types:
  31. 1. ID Hex based (for misconfig/network problems easily fixed)
  32. 2. IP based (for content modification)
  33. 3. IP+mask based (for continuous/egregious content modification)
  34. C. Emails results to tor-scanners@freehaven.net
  35. 3. Human Review and Appeal
  36. A. ID Hex-based BadExit is meant to be possible to removed easily
  37. without needing to beg us.
  38. - Should this behavior be encouraged?
  39. B. Optionally can reserve IP based badexits for human review
  40. 1. Results are encapsulated fully on the filesystem and can be
  41. reviewed without network access
  42. 2. Soat has --rescan to rescan failed nodes from a data directory
  43. - New set of URLs used