Browse Source

add additional data

Justin Tracey 4 months ago
parent
commit
80eb388175
6 changed files with 11 additions and 0 deletions
  1. 9 0
      README.md
  2. 0 0
      data/dyadic_count.dat
  3. 0 0
      data/file_sizes.dat
  4. 0 0
      data/group_count.dat
  5. 1 0
      data/group_sizes.dat
  6. 1 0
      data/group_sizes.no_individual.dat

+ 9 - 0
README.md

@@ -9,6 +9,7 @@ More thorough documentation is coming soon, but the gist is:
    - install the dependencies via `pip install -r requirements.txt`
    - run the shell script to invoke the python script in parallel
      ``./parallel_run.sh ../stats/ stats2/``
+   - **Note that these scripts in particular assume you will only be simulating up to 1 hour of conversation.**
  - Run the `process` tool to generate all empirical distributions other than message sizes.
    ``cargo run --bin process dists/ hmm/stats2/ json_files/chat*.json``
  - Run the `message-lens` tool to generate distributions for message sizes.
@@ -19,3 +20,11 @@ More thorough documentation is coming soon, but the gist is:
    ``cargo run --bin message-lens -- -s data/file_sizes.dat dists/ json_files/chat*.json``
 
 At this point, `dists/` will contain distributions ready for use in MGen, organized by the user being simulated.
+
+## Provided data
+
+We also provide some data that may be useful.
+As mentioned above, `data/file_sizes.dat` contains our own findings for distribution of file sizes in public WhatsApp groups as monitored in 2023.
+The `data/dyadic_count.dat` and  `data/group_count.dat` files give the relative fraction of the number of dyadic conversations (i.e., one-on-one conversations) and group conversations users were in, respectively, as sourced from ["Analysis of Group-Based Communication in WhatsApp"](https://doi.org/10.1007/978-3-319-26925-2_17).
+The `data/group_sizes.dat` file lists the absolute frequencies of each possible group size in WhatsApp, from 0 to 255, from the SaM dataset.
+The `data/group_sizes.no_individual.dat` file is the same, but with groups of size 0, 1, and 2 all set to a count of 0, as they are better modeled by the dyadic data (this file also includes a second row enumerating 0 to 255 for easier use with MGen-style parsing of the distribution).

File diff suppressed because it is too large
+ 0 - 0
data/dyadic_count.dat


File diff suppressed because it is too large
+ 0 - 0
data/file_sizes.dat


File diff suppressed because it is too large
+ 0 - 0
data/group_count.dat


+ 1 - 0
data/group_sizes.dat

@@ -0,0 +1 @@
+0,55,2290,177,220,202,171,145,151,135,151,97,94,102,82,94,80,70,56,59,65,44,58,49,43,51,35,38,44,37,39,28,45,25,27,18,14,23,21,22,24,23,14,13,7,20,15,10,11,13,16,9,12,8,8,6,14,9,13,7,16,10,3,6,6,8,9,5,5,7,7,11,13,5,3,6,9,5,5,5,6,4,7,8,7,1,4,11,6,4,7,7,5,1,6,6,6,6,5,5,5,6,3,5,8,4,1,9,3,3,4,5,4,10,6,8,3,5,5,3,5,3,9,5,4,2,2,2,6,1,3,2,3,2,1,1,4,1,5,2,5,0,3,0,1,1,7,4,1,1,3,2,0,1,1,0,3,1,2,2,0,1,3,7,1,4,1,3,1,1,2,2,0,1,0,0,1,0,1,2,1,1,2,4,0,2,0,0,0,1,3,0,0,3,0,2,1,0,3,2,2,1,4,3,0,1,0,1,0,2,2,1,4,0,0,2,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,2,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,2,0,1,0,0,0

File diff suppressed because it is too large
+ 1 - 0
data/group_sizes.no_individual.dat


Some files were not shown because too many files changed in this diff