6 月之前 · 4b84f3685a
--- a/README.md
+++ b/README.md
@@ -0,0 +1,8 @@
 
				+This repo contains tools to extract empirical distributions from the ["Share and Multiply" (SaM) dataset](https://figshare.com/articles/dataset/WhatsApp_Data_Set/19785193) of WhatsApp chat metadata.
			
 
				+
			
 
				+More thorough documentation is coming soon, but the gist is:
			
 
				+ - Download the `json_files.zip` file they provide, and extract it somewhere.
			
 
				+ - Run the `extract` tool to pare and serialize the SaM data.
			
 
				+ - Use the tools in `hmm` to label messages as "active" or "idle".
			
 
				+ - Run the `process` tool to generate all empirical distributions other than message sizes.
			
 
				+ - Run the `message-lens` tool to generate distributions for message sizes.
			
--- a/hmm/README.md
+++ b/hmm/README.md
@@ -0,0 +1,30 @@
 
				+Usage:
			
 
				+
			
 
				+`python3 get_w.py target_dir stats_file1 [stats_file2 ...]`
			
 
				+
			
 
				+where each `stats_file` is the output of SaM extractor's extract for a user; or, to run on all CPU cores:
			
 
				+
			
 
				+`./parallel_run.sh stats_dir_in stats_dir_out`
			
 
				+
			
 
				+Once a user has had their messages labeled here, SaM extractor's process command can be run on the output.
			
 
				+
			
 
				+
			
 
				+## Design
			
 
				+
			
 
				+Ideally, users in our simulation would generate conversations according to some sort of large machine learning model that takes as input all previous inter-arrival times (IATs, the time from this message since a previous message was last sent or received) and message sizes, and predicts this user's next message size and response time.
			
 
				+Such a model would easily be impossible to scale to large sizes.
			
 
				+Instead, we simplify to four states: idle, idle and just sent a message, idle and just received a message, and active.
			
 
				+See the state machine in the paper, which can be thought of as a slightly modified Markov model with two notions of time (namely, real time, which is continuous, and receiving messages, which is discrete).
			
 
				+Alternatively, it can be thought of as a true Markov model with states that grow linearly with the number of participants in the conversation (where receiving a message is instead represented by "another user's state" sending a message), but this would create far too many state transitions to model in practice.
			
 
				+
			
 
				+Initially, it would appear a Hidden Markov Model (HMM) or extension would be well-suited to finding all the parameters of our state machine.
			
 
				+Unfortunately, there is no established way to train HMMs or their extensions with multiple notions of time (namely, continuous and discrete forms of time).
			
 
				+Specifically, there is no efficient way to construct a HMM where state transitions may occur because *either* some time has passed or a message was received.
			
 
				+
			
 
				+Rather than attempt to find abstract models for distributions then, we instead opt to use empirical distributions (i.e., sampling directly from recorded values) wherever possible.
			
 
				+In order to do this, however, we must first determine a way to categorize messages so we know when to sample from which.
			
 
				+To avoid the problem of multiple notions of time, we temporarily discard any notion of receiving messages, and reduce our state machine to two states: idle and active.
			
 
				+We then index all messages a user sent into conversations: in each minute of each conversation, count the number of messages the user sent.
			
 
				+Because our simulations are not intended to simulate more than an hour of traffic, and because of how sparse conversations can be (e.g., using data indexed by the minute over the course of multiple years can become expensive), conversations are further broken down into fragments, where all fragments start and end with an hour of 0 message counts (or the end of the conversation, whichever came first), and any place in the conversation where more than two hours since the last message was sent gets broken into two fragments.
			
 
				+These fragments are then fed into a HMM learning algorithm (with Poisson distribution emissions) for each user, which generates a transition matrix.
			
 
				+Once we have that, we can predict the state for each message, and label it accordingly, for use in generating the empirical distributions we will actually simulate with.
			
--- a/hmm/get_w.py
+++ b/hmm/get_w.py
@@ -0,0 +1,67 @@
 
				+import numpy as np
			
 
				+import matplotlib.pyplot as plt
			
 
				+
			
 
				+from scipy.stats import poisson
			
 
				+from hmmlearn import hmm
			
 
				+
			
 
				+import sys
			
 
				+
			
 
				+def get_states(counts, lens):
			
 
				+    if len(counts) == 0 or len(lens) == 0:
			
 
				+        return
			
 
				+    counts = np.array([c for c in counts])
			
 
				+    lens = np.array([l for l in lens])
			
 
				+
			
 
				+    scores = list()
			
 
				+    models = list()
			
 
				+    for idx in range(10):  # ten different random starting states
			
 
				+        # define our hidden Markov model
			
 
				+        # (because we always prepend an hour of 0 messages,
			
 
				+        # and because it helps to ensure what the first state represents,
			
 
				+        # we set the probability of starting in the first state to 1,
			
 
				+        # and don't include start probability as a parameter to update)
			
 
				+        model = hmm.PoissonHMM(n_components=2, random_state=idx,
			
 
				+                               n_iter=10, params='tl', init_params='tl',
			
 
				+                               startprob_prior=np.array([1.0,0.0]),
			
 
				+                               lambdas_prior=np.array([[0.01], [0.1]]))
			
 
				+        model.startprob_ = np.array([1.0,0.0])
			
 
				+        model.fit(counts[:, None], lens)
			
 
				+        models.append(model)
			
 
				+        try:
			
 
				+            scores.append(model.score(counts[:, None], lens))
			
 
				+        except:
			
 
				+            print("igoring failed model scoring")
			
 
				+
			
 
				+    # get the best model
			
 
				+    model = models[np.argmax(scores)]
			
 
				+    try:
			
 
				+        states = model.predict(counts[:, None], lens)
			
 
				+    except:
			
 
				+        print("failed to predict")
			
 
				+        return None, None
			
 
				+    if model.lambdas_[0] > model.lambdas_[1]:
			
 
				+        states = [int(not(s)) for s in states]
			
 
				+
			
 
				+    return ','.join([str(s) for s in states]), ','.join([str(l) for l in model.lambdas_])
			
 
				+
			
 
				+target_dir = sys.argv[1]
			
 
				+for i in range(2, len(sys.argv)):
			
 
				+    file_path = sys.argv[i]
			
 
				+    with open(file_path) as f:
			
 
				+        lines = f.readlines()
			
 
				+
			
 
				+    counts = [int(n) for n in lines[0].strip().split(',')]
			
 
				+    lens = [int(n) for n in lines[1].strip().split(',')]
			
 
				+
			
 
				+    states, lambdas = get_states(counts, lens)
			
 
				+    if states is None:
			
 
				+        continue
			
 
				+
			
 
				+    file_out = target_dir + '/' + file_path.split('/')[-1]
			
 
				+    with open(file_out, 'w') as f:
			
 
				+        print(lines[0].strip(), file=f)
			
 
				+        print(lines[1].strip(), file=f)
			
 
				+        print(lines[2].strip(), file=f)
			
 
				+        print(lines[3].strip(), file=f)
			
 
				+        print(states, file=f)
			
 
				+        print(lambdas, file=f)
			
--- a/hmm/parallel_run.sh
+++ b/hmm/parallel_run.sh
@@ -0,0 +1,20 @@
 
				+#!/bin/bash
			
 
				+
			
 
				+if [[ $# -lt 2 ]] ; then
			
 
				+    echo "usage: $0 stats_dir_in stats_dir_out"
			
 
				+    exit 1
			
 
				+fi
			
 
				+
			
 
				+stats_dir_in="$1"
			
 
				+stats_dir_out="$2"
			
 
				+
			
 
				+n_files=$(ls "$stats_dir_in" | wc -l)
			
 
				+N=$(( $n_files / $(nproc) ))
			
 
				+
			
 
				+ls "$stats_dir_in" | while mapfile -n $N files_per_proc && [ ${#files_per_proc[@]} -gt 0 ]; do
			
 
				+        files="$(printf "$stats_dir_in/%s" "${files_per_proc[@]}")"
			
 
				+        python3 get_w.py "$stats_dir_out" $files &
			
 
				+done
			
 
				+wait
			
 
				+
			
 
				+echo "all done"
			
--- a/hmm/requirements.txt
+++ b/hmm/requirements.txt
@@ -0,0 +1,4 @@
 
				+hmmlearn
			
 
				+matplotlib
			
 
				+numpy
			
 
				+datetime