j3tracey/SaM-extractor: Extract data from the "Share and Multiply" dataset for use with MGen. @ 3fc10e14414b2841171953fcebed6fcf531c5bab

Extract data from the "Share and Multiply" dataset for use with MGen.

Justin Tracey 3fc10e1441 add better docs		6 ヶ月前
hmm	338aead827 hmm: fix bug in parallel_run.sh preventing waiting	6 ヶ月前
src	88e3e914be nit: use default hashmap	6 ヶ月前
Cargo.toml	903b430b76 incorporate file sizes into message lengths	1 年間前
README.md	3fc10e1441 add better docs	6 ヶ月前

This repo contains tools to extract empirical distributions from the "Share and Multiply" (SaM) dataset of WhatsApp chat metadata.

More thorough documentation is coming soon, but the gist is:

Download the json_files.zip file they provide, and extract it somewhere.
Run the extract tool to pare and serialize the SaM data. (Using chat*.json in any of the following commands means using all available chats; you can use a subset for faster processing, so long as you're consistent.) cargo run --bin extract stats/ json_files/chat*.json
Use the tools in hmm to label messages as "active" or "idle".
- install the dependencies via pip install -r requirements.txt
- run the shell script to invoke the python script in parallel ./parallel_run.sh ../stats/ stats2/
Run the process tool to generate all empirical distributions other than message sizes. cargo run --bin process dists/ hmm/stats2/ json_files/chat*.json
Run the message-lens tool to generate distributions for message sizes. This takes an optional argument for file sizes (must be first if provided, sorry for the jank). If you have a source for file sizes, you can provide it here. If you don't want to simulate sending files, you can omit it. If you don't have a source, you can use the one we provide based on public WhatsApp groups in 2023. cargo run --bin message-lens -- -s data/file_sizes.dat dists/ json_files/chat*.json

At this point, dists/ will contain distributions ready for use in MGen, organized by the user being simulated.