This repo contains tools to extract empirical distributions from the ["Share and Multiply" (SaM) dataset](https://figshare.com/articles/dataset/WhatsApp_Data_Set/19785193) of WhatsApp chat metadata. More thorough documentation is coming soon, but the gist is: - Download the `json_files.zip` file they provide, and extract it somewhere. - Run the `extract` tool to pare and serialize the SaM data. (Using `chat*.json` in any of the following commands means using all available chats; you can use a subset for faster processing, so long as you're consistent.) ``cargo run --bin extract stats/ json_files/chat*.json`` - Use the tools in `hmm` to label messages as "active" or "idle". - install the dependencies via `pip install -r requirements.txt` - run the shell script to invoke the python script in parallel ``./parallel_run.sh ../stats/ stats2/`` - **Note that these scripts in particular assume you will only be simulating up to 1 hour of conversation.** - Run the `process` tool to generate all empirical distributions other than message sizes. ``cargo run --bin process dists/ hmm/stats2/ json_files/chat*.json`` - Run the `message-lens` tool to generate distributions for message sizes. This takes an optional argument for file sizes (must be first if provided, sorry for the jank). If you have a source for file sizes, you can provide it here. If you don't want to simulate sending files, you can omit it. If you don't have a source, you can use the one we provide based on public WhatsApp groups in 2023. ``cargo run --bin message-lens -- -s data/file_sizes.dat dists/ json_files/chat*.json`` At this point, `dists/` will contain distributions ready for use in MGen, organized by the user being simulated. ## Provided data We also provide some data that may be useful. As mentioned above, `data/file_sizes.dat` contains our own findings for distribution of file sizes in public WhatsApp groups as monitored in 2023. The `data/dyadic_count.dat` and `data/group_count.dat` files give the relative fraction of the number of dyadic conversations (i.e., one-on-one conversations) and group conversations users were in, respectively, as sourced from ["Analysis of Group-Based Communication in WhatsApp"](https://doi.org/10.1007/978-3-319-26925-2_17). The `data/group_sizes.dat` file lists the absolute frequencies of each possible group size in WhatsApp, from 0 to 255, from the SaM dataset. The `data/group_sizes.no_individual.dat` file is the same, but with groups of size 0, 1, and 2 all set to a count of 0, as they are better modeled by the dyadic data (this file also includes a second row enumerating 0 to 255 for easier use with MGen-style parsing of the distribution).