Extract data from the "Share and Multiply" dataset for use with MGen.

Justin Tracey 4b84f3685a add some (too) minimal docs and markov model script 6 달 전
hmm 4b84f3685a add some (too) minimal docs and markov model script 6 달 전
src 88e3e914be nit: use default hashmap 6 달 전
Cargo.toml 903b430b76 incorporate file sizes into message lengths 1 년 전
README.md 4b84f3685a add some (too) minimal docs and markov model script 6 달 전

README.md

This repo contains tools to extract empirical distributions from the "Share and Multiply" (SaM) dataset of WhatsApp chat metadata.

More thorough documentation is coming soon, but the gist is:

  • Download the json_files.zip file they provide, and extract it somewhere.
  • Run the extract tool to pare and serialize the SaM data.
  • Use the tools in hmm to label messages as "active" or "idle".
  • Run the process tool to generate all empirical distributions other than message sizes.
  • Run the message-lens tool to generate distributions for message sizes.