Extract data from the "Share and Multiply" dataset for use with MGen.

Justin Tracey 80eb388175 add additional data 2 months ago
data 80eb388175 add additional data 2 months ago
hmm 338aead827 hmm: fix bug in parallel_run.sh preventing waiting 2 months ago
src 88e3e914be nit: use default hashmap 2 months ago
Cargo.toml 903b430b76 incorporate file sizes into message lengths 10 months ago
README.md 80eb388175 add additional data 2 months ago

README.md

This repo contains tools to extract empirical distributions from the "Share and Multiply" (SaM) dataset of WhatsApp chat metadata.

More thorough documentation is coming soon, but the gist is:

  • Download the json_files.zip file they provide, and extract it somewhere.
  • Run the extract tool to pare and serialize the SaM data. (Using chat*.json in any of the following commands means using all available chats; you can use a subset for faster processing, so long as you're consistent.) cargo run --bin extract stats/ json_files/chat*.json
  • Use the tools in hmm to label messages as "active" or "idle".
    • install the dependencies via pip install -r requirements.txt
    • run the shell script to invoke the python script in parallel ./parallel_run.sh ../stats/ stats2/
    • Note that these scripts in particular assume you will only be simulating up to 1 hour of conversation.
  • Run the process tool to generate all empirical distributions other than message sizes. cargo run --bin process dists/ hmm/stats2/ json_files/chat*.json
  • Run the message-lens tool to generate distributions for message sizes. This takes an optional argument for file sizes (must be first if provided, sorry for the jank). If you have a source for file sizes, you can provide it here. If you don't want to simulate sending files, you can omit it. If you don't have a source, you can use the one we provide based on public WhatsApp groups in 2023. cargo run --bin message-lens -- -s data/file_sizes.dat dists/ json_files/chat*.json

At this point, dists/ will contain distributions ready for use in MGen, organized by the user being simulated.

Provided data

We also provide some data that may be useful. As mentioned above, data/file_sizes.dat contains our own findings for distribution of file sizes in public WhatsApp groups as monitored in 2023. The data/dyadic_count.dat and data/group_count.dat files give the relative fraction of the number of dyadic conversations (i.e., one-on-one conversations) and group conversations users were in, respectively, as sourced from "Analysis of Group-Based Communication in WhatsApp". The data/group_sizes.dat file lists the absolute frequencies of each possible group size in WhatsApp, from 0 to 255, from the SaM dataset. The data/group_sizes.no_individual.dat file is the same, but with groups of size 0, 1, and 2 all set to a count of 0, as they are better modeled by the dyadic data (this file also includes a second row enumerating 0 to 255 for easier use with MGen-style parsing of the distribution).