Extract data from the "Share and Multiply" dataset for use with MGen.
Justin Tracey 3fc10e1441 add better docs | 7 months ago | |
---|---|---|
hmm | 7 months ago | |
src | 7 months ago | |
Cargo.toml | 1 year ago | |
README.md | 7 months ago |
This repo contains tools to extract empirical distributions from the "Share and Multiply" (SaM) dataset of WhatsApp chat metadata.
More thorough documentation is coming soon, but the gist is:
json_files.zip
file they provide, and extract it somewhere.extract
tool to pare and serialize the SaM data.
(Using chat*.json
in any of the following commands means using all available chats; you can use a subset for faster processing, so long as you're consistent.)
cargo run --bin extract stats/ json_files/chat*.json
hmm
to label messages as "active" or "idle".
pip install -r requirements.txt
./parallel_run.sh ../stats/ stats2/
process
tool to generate all empirical distributions other than message sizes.
cargo run --bin process dists/ hmm/stats2/ json_files/chat*.json
message-lens
tool to generate distributions for message sizes.
This takes an optional argument for file sizes (must be first if provided, sorry for the jank).
If you have a source for file sizes, you can provide it here.
If you don't want to simulate sending files, you can omit it.
If you don't have a source, you can use the one we provide based on public WhatsApp groups in 2023.
cargo run --bin message-lens -- -s data/file_sizes.dat dists/ json_files/chat*.json
At this point, dists/
will contain distributions ready for use in MGen, organized by the user being simulated.