Browse Source

Update README for USENIX Security artifact

Ian Goldberg 1 year ago
parent
commit
de34fc7e3b
1 changed files with 312 additions and 83 deletions
  1. 312 83
      README.md

+ 312 - 83
README.md

@@ -1,83 +1,312 @@
-CONTENTS OF THIS FILE
----------------------
-
- * Introduction
- * Setting up the docker and running the experimental scripts
- * Running the code
- * Important Warning
-
-
-INTRODUCTION
-------------
- `/preprocessing` holds code for the 3-Party preprocessing phase
- `/2p-preprocessing` holds the OT-based code 2-Party preprocessing phase
- `/duoram-online` holds the code for online phase of the 3-Party Duoram and 2-Party writes. For 2-Party writes, switch off the `-ThreeParty` flag in the Makefile
- `/cpir-read`  includes the code CPIR-based reads
-
-
-Setting up the docker and running the experimental scripts
-----------------------------------------------------------
--    Go to the docker folder`cd Docker/`
--   `sh build-docker.sh` builds the docker image
--   `sh run-docker.sh` launches three docker containers for parties P2, P0, and P1
--   `sh set-networking.sh` sets up the latencies and bandwidth constraints among the three docker containers launched; pass arguments of latency and bandwidth 
--    For example `sh set-networking.sh 1ms  40mbit`. 
--   `sh unset-networking.sh` removes the latency and bandwidth settings.
--   `sh experiments.sh` runs the pre-processing phase of the 3-party computation.
--   `sh experiments-online.sh` runs the online phase of the 3-party computation.
--   `sh experiments-OT.sh` runs the OTs to generate blinds of 2-party preprocesing phase.
--   `sh experiments-2p-preproc.sh` runs the pre-processing phase of the 2-party computation.
--   `sh experiments-cpir.sh` runs the CPIR-based reads.
--   `sh cleanup.sh` stops the three dockers.
-
-Running the code
------------------
-
-    1.  3-Party Prepreprocessing:
-        Let ip0, ip1, and ip2 be the IP addresses of the three docker containers. Consider a database of size `2^N`. Suppose the goal is to generate `r` DPFs.
-        The pre-processing generates DPFs with leaves of 128 bits. We use the first 64-bits for the writes. The remaining 64-bits are used for share conversion (XOR shared-flag vectors to additive shares)
-         - `cd duoram/preprocessing & make`
-         - `./p2preprocessing ip1 ip0 N r`
-         - `./preprocessing1  ip2 ip0 N r`
-         - `./preprocessing0  ip2 ip1 N r`
-        
-        This writes the evaluations of DPFs into a file which we use in the online phase.
-
-        To run the preprocessing with debug flags set:
-        -  `./debugp2preprocessing ip1 ip0 N r`
-        -  `./debugpreprocessing1  ip2 ip0 N r`
-        -  `./debugpreprocessing0  ip2 ip1 N r`
-
-
-    2.  3-Party Online Phase:
-         Let ip0, ip1, and ip2 be the IP addresses of the three docker containers. Consider a database of size `2^N`.  
-         - `cd duoram/duoram-online & make`
-         - `./p2      ip1 ip0 N W D I A` 
-         - `./duoram1 ip0 ip2 N W D I A`
-         - `./duoram0 ip1 ip2 N W D I A`
-
-        To run the online phase with debug flags set:
-        - `./debugp2 ip1 ip0 N r`
-        - `./debugduoram1  ip2 ip0 N r`
-        - `./debugduoram1  ip2 ip1 N r`
-
-    3. 2-Party Preprocessing:
-        Let ip0 and ip1 be the IP addresses of the two docker containers. To generate `W` blinding factors:
-         - `cd duoram/2p-preprocessing & make`
-         - `./OT ip0 ip1 1 128*W
-         - `./OT ip0 ip1 1 128*W
-         - `./preprocessing1  ip0 N r`
-         - `./preprocessing0  ip1 N r`
-    
-    4. 2-Party Read:
-         - `cd duoram/cpir-read/cxx & make`
-         - ./spir_test1 ip0 N threads npreproc queries 
-         - ./spir_test0 ip1 N threads npreproc queries 
-
- Important Warning
- --------------------
-This software is for performance testing ONLY! The purpose of this software is to evaluate the performance of the system, NOT to be used in a deployment scenario. The software is full of security vulnerabilities. 
-
- 
-
- 
+# Duoram USENIX Security 2023 artifact
+
+Adithya Vadapalli, adithya.vadapalli@uwaterloo.ca  
+Ryan Henry, ryan.henry@ucalgary.ca  
+Ian Goldberg, iang@uwaterloo.ca
+
+This repo contains scripts to run the code and reproduce the graphs in our paper:
+
+Adithya Vadapalli, Ryan Henry, Ian Goldberg. Duoram: A Bandwidth-Efficient Distributed ORAM for 2- and 3-Party Computation. USENIX Security Symposium 2023. [https://eprint.iacr.org/2022/1747](https://eprint.iacr.org/2022/1747)
+
+It also loads two other repos to compare our work to two other works:
+
+  - [Our dockerization](https://git-crysp.uwaterloo.ca/iang/floram-docker/src/usenixsec23_artifact)
+    of [Doerner and shelat's published code](https://gitlab.com/neucrypt/floram/) for their
+    [Floram](https://dl.acm.org/doi/10.1145/3133956.3133967)
+  - [Our dockerization](https://git-crysp.uwaterloo.ca/iang/circuit-oram-docker/src/usenixsec23_artifact)
+    of [Wei's published code](https://github.com/Boyoung-/circuit-oram-3pc) for
+    Jarecki and Wei's [3-party Circuit ORAM](https://eprint.iacr.org/2018/347)
+
+## Requirements
+
+To use this repo, you will need:
+
+  - A system with an x86 processor that supports AVX2 instructions.  This
+    instruction set was released in 2013, so most recent processors should
+    be fine.  We have tested it on both Intel and AMD processors.  On Linux, `grep avx2 /proc/cpuinfo` to see if your CPU can be used (if the output shows you CPU flags, then it can be; if the output is empty, it cannot).
+
+  - A basic Linux installation, with `git` and `docker` installed.  We
+    have tested it on Ubuntu 20.04 and Ubuntu 22.04, with
+    `apt install git docker.io`
+
+  - At least 16 GB of available RAM.  To run the optional "large" tests,
+    you will require at least 660 GB of available RAM (an atypical machine, to be sure, which is why the large tests are optional, and not essential to our major claims).
+
+## Quick start
+
+  - Clone this repo and check out the `usenixsec23_artifact` tag
+
+        git clone https://git-crysp.uwaterloo.ca/avadapal/duoram
+        cd duoram
+        git checkout usenixsec23_artifact
+
+  - Build the dockers; this will also download and build the dockers for the Floram and Circuit ORAM systems (approximate compute time: 15 minutes):
+
+        cd repro
+        ./build-all-dockers
+
+  - Run the "kick the tires" test (approximate compute time: about 1 minute):
+
+        ./repro-all-dockers test
+
+  - The output will have a stanza at the bottom labelled `===== Collated output =====
+`. In it, you will see the test output; for example:
+
+        2PDuoramOnln readwrite 16 1us 100gbit 2 0.86099545 s
+        2PDuoramOnln readwrite 16 1us 100gbit 2 22.046875 KiB
+        2PDuoramTotl readwrite 16 1us 100gbit 2 1.48480395 s
+        2PDuoramTotl readwrite 16 1us 100gbit 2 177.7138671875 KiB
+        3PDuoramOnln readwrite 16 1us 100gbit 2 0.012897 s
+        3PDuoramOnln readwrite 16 1us 100gbit 2 0.104166666666667 KiB
+        3PDuoramTotl readwrite 16 1us 100gbit 2 0.225875 s
+        3PDuoramTotl readwrite 16 1us 100gbit 2 12.7916666666667 KiB
+
+
+        Floram read 16 1us 100gbit 2 0.879635 s
+        Floram read 16 1us 100gbit 2 3837.724609375 KiB
+
+
+        CircuitORAMOnln read 16 1us 100gbit 2 0.313 s
+        CircuitORAMOnln read 16 1us 100gbit 2 710.625 KiB
+        CircuitORAMTotl read 16 1us 100gbit 2 0.753 s
+        CircuitORAMTotl read 16 1us 100gbit 2 4957 KiB
+
+  - What you see here are the four systems (2-party Duoram, 3-party Duoram, 2-party Floram, 3-party Circuit ORAM), with all except Floram showing both the online phase and the total of the preprocessing and the online phase.  (Floram does not have a separate preprocessing phase.) Each of those seven system/phase combinations shows the time taken for the test run, as well as the average bandwidth used per party.
+
+  - The output fields are:
+      - system and phase
+      - mode (reads, writes, or interleaved reads and writes)
+      - log<sub>2</sub> of the number of 64-bit words in the ORAM
+      - one-way latency between the parties (specifying "1us" really means not to add artificial latency, so you end up with the natural latency between dockers, which turns out to be 30 µs)
+      - bandwidth between the parties
+      - number of operations (number of reads or number of writes; interleaved reads and writes do this many reads interleaved with the same number of writes)
+      - the time in seconds or the bandwidth used in KiB, as indicated
+
+  - You should see the same set of 14 lines as shown above, though the exact times of course will vary according to your hardware. The bandwidths you see should match the above, however.
+
+  - If you run the test more than once, you will see means and stddevs of all of your runs.
+
+## Performance tuning
+
+There are a few environment variables you can set to tune the performance to the hardware configuration of your machine.
+
+  - **This one is easy and you should always do it**: if you have more than 16 GB of available RAM, set `DUORAM_MAXGB` to the number of available GB (units of 2<sup>30</sup> bytes, so really GiB).  In `bash` for example:
+
+        export DUORAM_MAXGB=64
+
+  - **These next ones are a bit in the weeds and are definitely optional.**
+
+  - If you have a NUMA machine with at least three NUMA nodes, then for the default (`small`) experiments, you're likely to want to run each of the three Duoram parties on a different NUMA node:
+
+        export DUORAM_NUMA_P0="numactl -N 0 -m 0"
+        export DUORAM_NUMA_P1="numactl -N 1 -m 1"
+        export DUORAM_NUMA_P2="numactl -N 2 -m 2"
+    and set `DUORAM_MAXGB` to _2 times_ the number of GB per NUMA node (counterintuitively, not 3; this is because Duoram's P2 (player 2) uses about twice the memory as P0 or P1). Note that you do _not_ need to have `numactl` actually installed on your host; it's installed in the dockers, and run from there.
+
+  - If you have a NUMA machine with just two NUMA nodes, then just leave out the `DUORAM_NUMA_P2` setting above, and keep the other two (`DUORAM_NUMA_P0` and `DUORAM_NUMA_P1`).  `DUORAM_MAXGB` should still be twice the size of each NUMA node, which in your case, will correctly be the size of all of memory, but subtract a bit for memory used by other processes (see the output of `free -g` to see what's available).  This is OK because P2's computation largely does not overlap with P0's and P1's computations.
+
+  - If you have a regular (non-NUMA) machine, then you might want to set P0 and P1 each to use half the cores; as above, P2 can use all of the cores.  You might see a slight improvement if you give both hyperthreads on each core to the same party, rather than one to each party.  So, for example, if you have a 16-core machine (32 hyperthreads), with `/proc/cpuinfo` showing that processors 0 to 15 are the "A sides" of each core, and processors 16 to 31 are the corresponding "B sides", then something like this would be appropriate:
+
+        export DUORAM_NUMA_P0="numactl -C 0-7,16-23"
+        export DUORAM_NUMA_P1="numactl -C 8-15,24-31"
+
+   - For the optional `large` experiments, the last mode above (P0 and P1 each get half the cores, P2 gets all the cores) is likely to be the best, even on a NUMA machine.
+
+## Running the reproduction experiments
+
+### Figures 7, 8, and 9
+
+Reproducing the data for the graphs in Figures 7, 8, and 9 of the paper is then simple.  There are two batches of experiments.  The majority of the experiments, and the ones essential to our major claims, are the `small` experiments, with ORAMs of size up to 2<sup>26</sup> 64-bit words (so 512 MiB).  These experiments reproduce all the data points in Figures 7, 8, and 9, except for the rightmost three datapoints in Figure 9.
+
+The `large` experiments use ORAMs of size 2<sup>28</sup>, 2<sup>30</sup>, and 2<sup>32</sup> 64-bit words (so 2 GiB, 8 GiB, and 32 GiB respectively), only for Figure 9. Again, the `large` experiments are not essential to our major claims, and are optional to run.  They require a `DUORAM_MAXGB` setting of at least 660.
+
+To do the main (`small`) experiments (still in the `repro` directory):
+
+  - <code>./repro-all-dockers small _numops_</code>
+
+Here, <code>_numops_</code> is the number of read and write operations to do in each experiment.  The default is 128, which is the number we used for the graphs in the paper, but if you use a smaller number, the experiments will run faster, and the qualitative results should still be evident.
+
+Similarly, to do the optional `large` experiments:
+
+  - <code>./repro-all-dockers large _numops_</code>
+
+
+Note that for <code>_numops_</code> of 128, a complete run of the `small` experiments will take about 10 hours (depending on your hardware, of course), and a complete run of the `large` experiments will take about 40 hours.
+
+Instead of `small` or `large` on the above command lines, you can also use `all` (which runs both `small` and `large`) or `none`, which doesn't run any new experiments, but still outputs the datapoints for the figures (discussed next). You should use the same value of <code>_numops_</code> for all runs (including `none`) that are meant to produce data for the same figures.
+
+### The output
+
+Runs of `repro-all-dockers` other than the `test` run will output the data for Figures 7(a), 7(b), 7(c), 8(a), 8(b), 8(c), 9(a), 9(b), and 9(c), clearly labelled, and separated into the data for each line in each subfigure.  Running `repro-all-dockers` multiple times will accumulate data, and means and stddevs will be output for all data points when more than one run has completed.  From this data, you should be able to verify our major claims (though depending on your hardware, the exact numbers will surely vary).
+
+### Figure 10
+
+Reproducing Figure 10 (the effect of scaling the number of cores) is
+slightly more work, because it depends more on your hardware
+configuration.  The top of the script `repro-scaling` in the `repro`
+directory has instructions.  Basically, set the variables (in the
+script, not environment variables) `BASE_DUORAM_NUMA_P0` and
+`BASE_DUORAM_NUMA_P1` to numactl commands (examples are given in the
+comments) to divide your system into two partitions as separate as
+possible: separate NUMA nodes if you have them, otherwise separate CPUs
+if you have them, otherwise separate cores.  If each of your two
+partitions has _n_ cores, ensure that the elements of `CORESLIST` do not
+exceed _n_ (but of course you will not be able to replicate those data
+points in that case). The paper uses values of _n_ up to 32 cores in
+each partition, so 64 cores in total; as above, P2 can reuse the cores
+of P0, since P2's main work is done after P0 and P1's main work has
+finished.
+
+So the steps are:
+
+  - `cd repro`
+  - Edit `repro-scaling` to set `BASE_DUORAM_NUMA_P0`,
+    `BASE_DUORAM_NUMA_P1`, and `CORESLIST` according to your hardware
+    configuration
+  - Run <code>./repro-scaling _numops_</code>, where
+    <code>_numops_</code> as before defaults to 128.
+
+For <code>_numops_</code> of 128, this should take around 4 to 5 hours to
+run.  The output will be similar to that [described above](#the-output),
+with clearly labelled data for each of Figures 10(a), 10(b), 10(c).
+
+## Running only the Duoram reproduction scripts
+
+The above instructions will run the reproduction scripts for all of Duoram, Floram, and Circuit ORAM.  If you just want to run the Duoram experiments, the instructions follow.  If you just want to run the Floram or Circuit ORAM experiments, see the README.md files in our [floram-docker](https://git-crysp.uwaterloo.ca/iang/floram-docker/src/usenixsec23_artifact) and [circuit-oram-docker](https://git-crysp.uwaterloo.ca/iang/circuit-oram-docker/src/usenixsec23_artifact) repos respectively.
+
+Follow these instructions to reproduce the Duoram data points (timings
+and bandwidth usage of 2-party and 3-party Duoram operations for various ORAM sizes and
+network settings) for the plots in our paper.  See
+[below](#manual-instructions) if you want to run experiments of your
+choosing.
+
+  - `cd Docker`
+  - Build the docker image with `./build-docker`
+  - Start the dockers with `./start-docker`
+    - This will start three dockers, each running one of the parties.
+  - Run the reproduction script `./repro` with one of the following
+    arguments:
+    - <code>./repro test</code>: Run a short (about one minute) "kick-the-tires" test.
+      You should see output like the following:
+
+            2PDuoramOnln readwrite 16 1us 100gbit 2 0.492821624 s
+            2PDuoramOnln readwrite 16 1us 100gbit 2 22.046875 KiB
+            2PDuoramTotl readwrite 16 1us 100gbit 2 1.234094624 s
+            2PDuoramTotl readwrite 16 1us 100gbit 2 177.7138671875 KiB
+            3PDuoramOnln readwrite 16 1us 100gbit 2 0.0037378 s
+            3PDuoramOnln readwrite 16 1us 100gbit 2 0.104166666666667 KiB
+            3PDuoramTotl readwrite 16 1us 100gbit 2 0.2907178 s
+            3PDuoramTotl readwrite 16 1us 100gbit 2 12.7916666666667 KiB
+
+      These lines are the output data points, telling you that a
+      2-party Duoram interleaved read/write test on an ORAM of size 2<sup>16</sup>, with a network
+      configuration of 1us latency and 100gbit bandwidth, performing 2
+      read operations, took 0.492821624 s of time and 22.046875 KiB of
+      bandwidth in the online phase, and 1.234094624 s of time and
+      177.7138671875 KiB of bandwidth for the total of the preprocessing
+      and online phases.  (And similarly for the 3-party Duoram
+      datapoints in the next four lines.) If you've run the test before,
+      you will see means and stddevs of all of the output data points.
+      When you run it, the time of course will depend on the particulars
+      of your hardware, but the bandwidth used should be exactly the
+      values quoted above.
+
+    - <code>./repro small _numops_</code>: Run the "small" tests.  These
+      are the tests up to size 2<sup>26</sup>, and produce all the data
+      points for Figures 7 and 8, and most of Figure 9.
+      <code>_numops_</code> is the number of operations to run for each
+      test; we used the default of 128 for the figures in the paper, but
+      you can use a lower number to make the tests run faster.  For the
+      default of 128, these tests should complete in about 4 to 5 hours,
+      and require 16 GB of available RAM.
+
+    - <code>./repro large _numops_</code>: Run the "large" tests.  These
+      are the rightmost 3 data points in Figure 9. They are not
+      essential to our major claims, so they are optional to run,
+      and you will definitely require a larger machine to run them.
+      For the default <code>_numops_</code> of 128, these experiments
+      will require about 28 hours to run and 660 GB of available RAM.
+      Reducing <code>_numops_</code> will reduce the runtime, but will
+      not change the RAM requirements.
+
+    - <code>./repro all _numops_</code>: Run both the "small" and
+      "large" tests.
+
+    - <code>./repro none _numops_</code>: Run no tests. This command is
+      nonetheless useful in order to parse the output logs and display
+      the data points for the graphs (see below).
+
+    - <code>./repro single _mode_ _size_ _latency_ _bandwidth_
+      _numops_ _phase_ _numparties_</code>: run a single manually
+      selected test with the given parameters.
+
+    - After `small`, `large`, `all`, or `none`, the script will parse
+      all of the outputs that have been collected with the specified
+      <code>_numops_</code> (in this run or previous runs), and output
+      them as they would appear in each of the subfigures of Figures 7,
+      8, and 9.
+
+  - When you're done, `./stop-docker`
+
+## Manual instructions
+
+Use this instructions if you want to run individual Duoram experiments
+of your own choosing.
+
+  - `cd Docker`
+  - `./build-docker`
+  - `./start-docker`
+    - This will start three dockers, each running one of the parties.
+
+Then to simulate network latency and capacity (optional):
+
+  - `./set-networking 30ms 100mbit`
+
+To turn that off again:
+
+  - `./unset-networking`
+
+Run experiments:
+
+  - <code>./run-experiment _mode_ _size_ _numops_ _phase_ _numparties_ &gt;&gt; _outfile_</code>
+    - <code>_mode_</code> is one of `read`, `write`, or `readwrite`.
+      Defaults to `read`.
+    - <code>_size_</code> is the base-2 log of the number of 64-bit entries in the ORAM (so <code>_size_</code> = 20 is an ORAM with 1048576 entries, for example).  Defaults to 20.
+    - <code>_numops_</code> is the number of operations to perform; one setup will be followed by <code>_numops_</code> operations, where each operation is a read, a write, or a read plus a write, depending on the <code>_mode_</code>. Defaults to 128.
+    - <code>_phase_</code> is `preproc` or `online`. Default is `online`.
+    - <code>_numparties_</code> is `2P` or `3P`. Default is `3P`.
+
+  - Note that the 3-party experiments actually do all of the `read`,
+    `write`, and `readwrite` modes at once, and the 2-party `read`
+    experiment does the `preproc` and `online` phases at once.  Also,
+    the 2-party `read` and `write` modes are completely independent, so
+    the 2-party `readwrite` is just the sum of `read` and `write`. So the
+    following five combinations of settings are the ones that make
+    sense:
+      - <code>./run_experiment read _size_ _numops_ online 2P</code>
+      - <code>./run_experiment write _size_ _numops_ preproc 2P</code>
+      - <code>./run_experiment write _size_ _numops_ online 2P</code>
+      - <code>./run_experiment read _size_ _numops_ preproc 3P</code>
+      - <code>./run_experiment readwrite _size_ _numops_ online 3P</code>
+
+  - <code>./parse\_logs _outfile_</code>
+    - Parses the file output by one or more executions of `./run-experiment` to extract, for each experiment, the average number of bytes sent by each party and the time taken.
+
+When you're all done:
+
+  - `./stop-docker`
+
+## Under the hood
+
+The directories in this repo have the following roles:
+
+  - `repro`: the scripts to reproduce the graphs from the paper, including Duoram, Floram, and Circuit ORAM
+  - `Docker`: the scripts to run just the Duoram experiments
+  - `2p-preprocessing`: the 2-party write protocol preprocessing phase
+  - `cpir-read`: the 2-party read protocol preprocessing and online phases
+  - `preprocessing`: the 3-party read, write, and readwrite preprocessing phases
+  - `duoram-online`: the 2-party write and 3-party read, write, and readwrite online phases
+
+## Important warning
+
+This software is for performance testing ONLY! The purpose of this software is to evaluate the performance of the system, NOT to be used in a deployment scenario. The software is full of security vulnerabilities, and is not highly optimized.