Adithya Vadapalli, adithya.vadapalli@uwaterloo.ca
Ryan Henry, ryan.henry@ucalgary.ca
Ian Goldberg, iang@uwaterloo.ca
This repo contains scripts to run the code and reproduce the graphs in our paper:
Adithya Vadapalli, Ryan Henry, Ian Goldberg. Duoram: A Bandwidth-Efficient Distributed ORAM for 2- and 3-Party Computation. USENIX Security Symposium 2023. https://eprint.iacr.org/2022/1747
It also loads two other repos to compare our work to two other works:
To use this repo, you will need:
A system with an x86 processor that supports AVX2 instructions. This
instruction set was released in 2013, so most recent processors should
be fine. We have tested it on both Intel and AMD processors. On Linux, grep avx2 /proc/cpuinfo
to see if your CPU can be used (if the output shows you CPU flags, then it can be; if the output is empty, it cannot).
A basic Linux installation, with git
and docker
installed. We
have tested it on Ubuntu 20.04 and Ubuntu 22.04, with
apt install git docker.io
At least 16 GB of available RAM. To run the optional "large" tests, you will require at least 660 GB of available RAM (an atypical machine, to be sure, which is why the large tests are optional, and not essential to our major claims).
Clone this repo and check out the usenixsec23_artifact
tag
git clone https://git-crysp.uwaterloo.ca/avadapal/duoram
cd duoram
git checkout usenixsec23_artifact
Build the dockers; this will also download and build the dockers for the Floram and Circuit ORAM systems (approximate compute time: 15 minutes):
cd repro
./build-all-dockers
Run the "kick the tires" test (approximate compute time: about 1 minute):
./repro-all-dockers test
The output will have a stanza at the bottom labelled ===== Collated output =====
. In it, you will see the test output; for example:
2PDuoramOnln readwrite 16 1us 100gbit 2 0.86099545 s
2PDuoramOnln readwrite 16 1us 100gbit 2 22.046875 KiB
2PDuoramTotl readwrite 16 1us 100gbit 2 1.48480395 s
2PDuoramTotl readwrite 16 1us 100gbit 2 177.7138671875 KiB
3PDuoramOnln readwrite 16 1us 100gbit 2 0.012897 s
3PDuoramOnln readwrite 16 1us 100gbit 2 0.104166666666667 KiB
3PDuoramTotl readwrite 16 1us 100gbit 2 0.225875 s
3PDuoramTotl readwrite 16 1us 100gbit 2 12.7916666666667 KiB
Floram read 16 1us 100gbit 2 0.879635 s
Floram read 16 1us 100gbit 2 3837.724609375 KiB
CircuitORAMOnln read 16 1us 100gbit 2 0.313 s
CircuitORAMOnln read 16 1us 100gbit 2 710.625 KiB
CircuitORAMTotl read 16 1us 100gbit 2 0.753 s
CircuitORAMTotl read 16 1us 100gbit 2 4957 KiB
What you see here are the four systems (2-party Duoram, 3-party Duoram, 2-party Floram, 3-party Circuit ORAM), with all except Floram showing both the online phase and the total of the preprocessing and the online phase. (Floram does not have a separate preprocessing phase.) Each of those seven system/phase combinations shows the time taken for the test run, as well as the average bandwidth used per party.
The output fields are:
You should see the same set of 14 lines as shown above, though the exact times of course will vary according to your hardware. The bandwidths you see should match the above, however.
If you run the test more than once, you will see means and stddevs of all of your runs.
There are a few environment variables you can set to tune the performance to the hardware configuration of your machine.
This one is easy and you should always do it: if you have more than 16 GB of available RAM, set DUORAM_MAXGB
to the number of available GB (units of 230 bytes, so really GiB). In bash
for example:
export DUORAM_MAXGB=64
These next ones are a bit in the weeds and are definitely optional.
If you have a NUMA machine with at least three NUMA nodes, then for the default (small
) experiments, you're likely to want to run each of the three Duoram parties on a different NUMA node:
export DUORAM_NUMA_P0="numactl -N 0 -m 0"
export DUORAM_NUMA_P1="numactl -N 1 -m 1"
export DUORAM_NUMA_P2="numactl -N 2 -m 2"
and set DUORAM_MAXGB
to 2 times the number of GB per NUMA node (counterintuitively, not 3; this is because Duoram's P2 (player 2) uses about twice the memory as P0 or P1). Note that you do not need to have numactl
actually installed on your host; it's installed in the dockers, and run from there.
If you have a NUMA machine with just two NUMA nodes, then just leave out the DUORAM_NUMA_P2
setting above, and keep the other two (DUORAM_NUMA_P0
and DUORAM_NUMA_P1
). DUORAM_MAXGB
should still be twice the size of each NUMA node, which in your case, will correctly be the size of all of memory, but subtract a bit for memory used by other processes (see the output of free -g
to see what's available). This is OK because P2's computation largely does not overlap with P0's and P1's computations.
If you have a regular (non-NUMA) machine, then you might want to set P0 and P1 each to use half the cores; as above, P2 can use all of the cores. You might see a slight improvement if you give both hyperthreads on each core to the same party, rather than one to each party. So, for example, if you have a 16-core machine (32 hyperthreads), with /proc/cpuinfo
showing that processors 0 to 15 are the "A sides" of each core, and processors 16 to 31 are the corresponding "B sides", then something like this would be appropriate:
export DUORAM_NUMA_P0="numactl -C 0-7,16-23"
export DUORAM_NUMA_P1="numactl -C 8-15,24-31"
large
experiments, the last mode above (P0 and P1 each get half the cores, P2 gets all the cores) is likely to be the best, even on a NUMA machine.Reproducing the data for the graphs in Figures 7, 8, and 9 of the paper is then simple. There are two batches of experiments. The majority of the experiments, and the ones essential to our major claims, are the small
experiments, with ORAMs of size up to 226 64-bit words (so 512 MiB). These experiments reproduce all the data points in Figures 7, 8, and 9, except for the rightmost three datapoints in Figure 9.
The large
experiments use ORAMs of size 228, 230, and 232 64-bit words (so 2 GiB, 8 GiB, and 32 GiB respectively), only for Figure 9. Again, the large
experiments are not essential to our major claims, and are optional to run. They require a DUORAM_MAXGB
setting of at least 660.
To do the main (small
) experiments (still in the repro
directory):
./repro-all-dockers small numops
Here, numops
is the number of read and write operations to do in each experiment. The default is 128, which is the number we used for the graphs in the paper, but if you use a smaller number, the experiments will run faster, and the qualitative results should still be evident.
Similarly, to do the optional large
experiments:
./repro-all-dockers large numops
Note that for numops
of 128, a complete run of the small
experiments will take about 10 hours (depending on your hardware, of course), and a complete run of the large
experiments will take about 40 hours.
Instead of small
or large
on the above command lines, you can also use all
(which runs both small
and large
) or none
, which doesn't run any new experiments, but still outputs the datapoints for the figures (discussed next). You should use the same value of numops
for all runs (including none
) that are meant to produce data for the same figures.
Runs of repro-all-dockers
other than the test
run will output the data for Figures 7(a), 7(b), 7(c), 8(a), 8(b), 8(c), 9(a), 9(b), and 9(c), clearly labelled, and separated into the data for each line in each subfigure. Running repro-all-dockers
multiple times will accumulate data, and means and stddevs will be output for all data points when more than one run has completed. From this data, you should be able to verify our major claims (though depending on your hardware, the exact numbers will surely vary).
Reproducing Figure 10 (the effect of scaling the number of cores) is
slightly more work, because it depends more on your hardware
configuration. The top of the script repro-scaling
in the repro
directory has instructions. Basically, set the variables (in the
script, not environment variables) BASE_DUORAM_NUMA_P0
and
BASE_DUORAM_NUMA_P1
to numactl commands (examples are given in the
comments) to divide your system into two partitions as separate as
possible: separate NUMA nodes if you have them, otherwise separate CPUs
if you have them, otherwise separate cores. If each of your two
partitions has n cores, ensure that the elements of CORESLIST
do not
exceed n (of course you will not be able to replicate those data
points in that case, but the trend should still be apparent). The paper
uses values of n up to 32 cores in each partition, so 64 cores in
total; as above, P2 can reuse the cores of P0, since P2's main work is
done after P0 and P1's main work has finished.
So the steps are:
cd repro
repro-scaling
to set BASE_DUORAM_NUMA_P0
,
BASE_DUORAM_NUMA_P1
, and CORESLIST
according to your hardware
configuration./repro-scaling numops
, where
numops
as before defaults to 128.For numops
of 128, this should take around 4 to 5 hours to
run. The output will be similar to that described above,
with clearly labelled data for each of Figures 10(a), 10(b), 10(c).
The above instructions will run the reproduction scripts for all of Duoram, Floram, and Circuit ORAM. If you just want to run the Duoram experiments, the instructions follow. If you just want to run the Floram or Circuit ORAM experiments, see the README.md files in our floram-docker and circuit-oram-docker repos respectively.
Follow these instructions to reproduce the Duoram data points (timings and bandwidth usage of 2-party and 3-party Duoram operations for various ORAM sizes and network settings) for the plots in our paper. See below if you want to run experiments of your choosing.
cd Docker
./build-docker
./start-docker
Run the reproduction script ./repro
with one of the following
arguments:
./repro test
: Run a short (about one minute) "kick-the-tires" test.
You should see output like the following:
2PDuoramOnln readwrite 16 1us 100gbit 2 0.492821624 s
2PDuoramOnln readwrite 16 1us 100gbit 2 22.046875 KiB
2PDuoramTotl readwrite 16 1us 100gbit 2 1.234094624 s
2PDuoramTotl readwrite 16 1us 100gbit 2 177.7138671875 KiB
3PDuoramOnln readwrite 16 1us 100gbit 2 0.0037378 s
3PDuoramOnln readwrite 16 1us 100gbit 2 0.104166666666667 KiB
3PDuoramTotl readwrite 16 1us 100gbit 2 0.2907178 s
3PDuoramTotl readwrite 16 1us 100gbit 2 12.7916666666667 KiB
These lines are the output data points, telling you that a 2-party Duoram interleaved read/write test on an ORAM of size 216, with a network configuration of 1us latency and 100gbit bandwidth, performing 2 read operations, took 0.492821624 s of time and 22.046875 KiB of bandwidth in the online phase, and 1.234094624 s of time and 177.7138671875 KiB of bandwidth for the total of the preprocessing and online phases. (And similarly for the 3-party Duoram datapoints in the next four lines.) If you've run the test before, you will see means and stddevs of all of the output data points. When you run it, the time of course will depend on the particulars of your hardware, but the bandwidth used should be exactly the values quoted above.
./repro small numops
: Run the "small" tests. These
are the tests up to size 226, and produce all the data
points for Figures 7 and 8, and most of Figure 9.
numops
is the number of operations to run for each
test; we used the default of 128 for the figures in the paper, but
you can use a lower number to make the tests run faster. For the
default of 128, these tests should complete in about 4 to 5 hours,
and require 16 GB of available RAM.
./repro large numops
: Run the "large" tests. These
are the rightmost 3 data points in Figure 9. They are not
essential to our major claims, so they are optional to run,
and you will definitely require a larger machine to run them.
For the default numops
of 128, these experiments
will require about 28 hours to run and 660 GB of available RAM.
Reducing numops
will reduce the runtime, but will
not change the RAM requirements.
./repro all numops
: Run both the "small" and
"large" tests.
./repro none numops
: Run no tests. This command is
nonetheless useful in order to parse the output logs and display
the data points for the graphs (see below).
./repro single mode size latency bandwidth
numops phase numparties
: run a single manually
selected test with the given parameters.
After small
, large
, all
, or none
, the script will parse
all of the outputs that have been collected with the specified
numops
(in this run or previous runs), and output
them as they would appear in each of the subfigures of Figures 7,
8, and 9.
When you're done, ./stop-docker
Use this instructions if you want to run individual Duoram experiments of your own choosing.
cd Docker
./build-docker
./start-docker
Then to simulate network latency and capacity (optional):
./set-networking 30ms 100mbit
To turn that off again:
./unset-networking
Run experiments:
./run-experiment mode size numops phase numparties >> outfile
mode
is one of read
, write
, or readwrite
.
Defaults to read
.size
is the base-2 log of the number of 64-bit entries in the ORAM (so size
= 20 is an ORAM with 1048576 entries, for example). Defaults to 20.numops
is the number of operations to perform; one setup will be followed by numops
operations, where each operation is a read, a write, or a read plus a write, depending on the mode
. Defaults to 128.phase
is preproc
or online
. Default is online
.numparties
is 2P
or 3P
. Default is 3P
.Note that the 3-party experiments actually do all of the read
,
write
, and readwrite
modes at once, and the 2-party read
experiment does the preproc
and online
phases at once. Also,
the 2-party read
and write
modes are completely independent, so
the 2-party readwrite
is just the sum of read
and write
. So the
following five combinations of settings are the ones that make
sense:
./run_experiment read size numops online 2P
./run_experiment write size numops preproc 2P
./run_experiment write size numops online 2P
./run_experiment read size numops preproc 3P
./run_experiment readwrite size numops online 3P
./parse_logs outfile
./run-experiment
to extract, for each experiment, the average number of bytes sent by each party and the time taken.When you're all done:
./stop-docker
The directories in this repo have the following roles:
repro
: the scripts to reproduce the graphs from the paper, including Duoram, Floram, and Circuit ORAMDocker
: the scripts to run just the Duoram experiments2p-preprocessing
: the 2-party write protocol preprocessing phasecpir-read
: the 2-party read protocol preprocessing and online phasespreprocessing
: the 3-party read, write, and readwrite preprocessing phasesduoram-online
: the 2-party write and 3-party read, write, and readwrite online phasesThis software is for performance testing ONLY! The purpose of this software is to evaluate the performance of the system, NOT to be used in a deployment scenario. The software is full of security vulnerabilities, and is not highly optimized.