Extract data from the "Share and Multiply" dataset for use with MGen.

Justin Tracey 4b84f3685a add some (too) minimal docs and markov model script 6 місяців тому
hmm 4b84f3685a add some (too) minimal docs and markov model script 6 місяців тому
src 88e3e914be nit: use default hashmap 6 місяців тому
Cargo.toml 903b430b76 incorporate file sizes into message lengths 1 рік тому
README.md 4b84f3685a add some (too) minimal docs and markov model script 6 місяців тому

README.md

This repo contains tools to extract empirical distributions from the "Share and Multiply" (SaM) dataset of WhatsApp chat metadata.

More thorough documentation is coming soon, but the gist is:

  • Download the json_files.zip file they provide, and extract it somewhere.
  • Run the extract tool to pare and serialize the SaM data.
  • Use the tools in hmm to label messages as "active" or "idle".
  • Run the process tool to generate all empirical distributions other than message sizes.
  • Run the message-lens tool to generate distributions for message sizes.