Cryptography and Machine Learning

Growing TF Encrypted

2019-05-17T10:00:00+00:00

What started out as a side project less than two years ago is growing up and moving into its own organization on GitHub!

The tremendous growth we have seen would not have been possible without partner contributors, and with this move TF Encrypted is being cemented as an independent community project that can encourage participation and remain focused on its mission: getting privacy-enhancing tools into the hands of machine learning practitioners.

This is a cross-posting of work done at Dropout Labs. A big thank you to Gavin Uhma, Ian Livingstone, Jason Mancuso, and Matt Maclellan for help with this post.

A Framework for Encrypted Deep Learning

TF Encrypted makes it easy to apply machine learning to data that remains encrypted at all times. It builds on, and integrates heavily, with TensorFlow, providing a familiar interface and encouraging mixing ordinary and encrypted computations. Together this ensures a pragmatic and gradual approach to a maturing technology.

The core consists of secure computation optimized for deep learning, as well as standard deep learning components adapted to work more efficiently on encrypted data. However, the whole purpose is to abstract all of this away.

As an example, the following code snippet shows how one can serve predictions on encrypted inputs, in this case using a small neural network. It closely resembles traditional TensorFlow code, with the exception of tfe.define_private_input and tfe.define_output that are used to express our desired privacy policy: that only the client should be able to see the input and the result in plaintext, and everyone else must only see them in an encrypted state.

import tensorflow as tf
import tf_encrypted as tfe

def provide_weights(): """Load model weight from disk using TensorFlow."""
def provide_input(): """Load and preprocess input data locally on the client."""
def receive_output(logits): return tf.print(tf.argmax(logits))

w0, b0, w1, b1, w2, b2 = provide_weights()

# run provide_input locally on the client and encrypt
x = tfe.define_private_input("prediction-client", provide_input)

# compute prediction on the encrypted input
layer0 = tfe.relu(tfe.matmul(x, w0) + b0)
layer1 = tfe.relu(tfe.matmul(layer0, w1) + b1)
logits = tfe.matmul(layer1, w2) + b2

# send results back to client, decrypt, and run receive_output locally
prediction_op = tfe.define_output("prediction-client", receive_output, logits)

with tfe.Session() as sess:
    sess.run(prediction_op)

Below we can see that TF Encrypted is also a natural fit for secure aggregation in federated learning. Here, in each iteration, gradients are computed locally by data owners using ordinary TensorFlow. They are then given as encrypted inputs to a secure computation of their mean, which in turn is revealed to the model owner who updates the model.

# compute and collect all model gradients as private inputs
model_grads = zip(*[
    tfe.define_private_input(
        data_owner.player_name,
        data_owner.compute_gradient)
    for data_owner in data_owners
])

# compute mean gradient securely
aggregated_model_grads = [
    tfe.add_n(grads) / len(grads)
    for grads in model_grads
]

# reveal only aggregated gradients to model owner
iteration_op = tfe.define_output(
    model_owner.player_name,
    model_owner.update_model,
    aggregated_model_grads)

with tfe.Session() as sess:
    for _ in range(num_iterations):
        sess.run(iteration_op)

Because of tight integration with TensorFlow, this process can easily be profiled and visualized using TensorBoard, as shown in the full example.

Finally, is it also possible to perform encrypted training on joint data sets. In the snippet below, two data owners provide encrypted training data that is merged and subsequently used as any other data set.

x_train_0, y_train_0 = tfe.define_private_input(
    data_owner_0.player_name,
    data_owner_0.provide_training_data)

x_train_1, y_train_1 = tfe.define_private_input(
    data_owner_1.player_name,
    data_owner_1.provide_training_data)

x_train = tfe.concat([x_train_0, x_train_1], axis=0)
y_train = tfe.concat([y_train_0, y_train_1], axis=0)

The GitHub repository contains several more examples, including notebooks to help you get started.

Moving Forward as a Community

Since the beginning, the motivation behind TF Encrypted has been to explore and unlock the impact of privacy-preserving machine learning; and the approach taken is to help practitioners get their hands dirty and experiment.

A sub-goal of this is to help improve communication between people within different areas of expertise, including creating a common vocabulary for more efficient knowledge sharing.

To really scale this up we need to bring as many people as possible together, as this means a better collective understanding, more exploration, and more identified use cases. And the only natural place for this to happen is where you feel comfortable and encouraged to contribute.

Use cases under constraints

Getting data scientists involved is key, as the technology has reached a maturity where it can be applied to real-world problems, yet is still not ready to simply be treated as a black box; even for solving problems that on paper may otherwise seem like a perfect fit.

Instead, to further bring the technology out from research circles, and find the right use cases given current constraints, we need people with domain knowledge to benchmark on the problems they face, and report on their findings.

Helping them get started quickly, and reducing their learning curve, is a key goal of TF Encrypted.

Cross-disciplinary research

At the same time, it is important that the runtime performance of the underlying technology continues to improve, as this makes more use cases practical.

The most obvious way of doing that is for researchers in cryptography to continue the development of secure computation and its adaptation to deep learning. However, this currently requires them to gain an intuition into machine learning that most do not have.

Orthogonal to improving how the computations are performed, another direction is to improve what functions are computed. This means adapting machine learning models to the encrypted setting and essentially treating it as a new type of computing device with its own characteristics; for which some operations, or even model types, are more suitable. However, this currently requires an understanding of cryptography that most do not have.

Forming a bridge that helps these two fields collaborate, yet stay focused on their area of expertise, is another key goal of TF Encrypted.

Common platform

Frameworks like TensorFlow have shown the benefits of bringing practitioners together on the same software platform. It makes everything concrete, including vocabulary, and shortens the distance from research to application. It makes everyone move towards the same target, yet via good abstractions allows each to focus on what they do best while still benefiting from the contributions of others. In other words, it facilitates taking a modular approach to the problem, lowering the overhead of everyone first developing expertise across all domains.

All of this leads to the core belief behind TF Encrypted: that we can push the field of privacy-preserving machine learning forward by building a common and integrated platform that makes tools and techniques for encrypted deep learning easily accessible.

To do this we welcome partners and contributors from all fields, including companies that want to leverage the accumulated expertise while keeping their focus on all the remaining questions around for instance taking this all the way to production.

Challenges and Roadmap

Building the current version of TF Encrypted was only the first step, with many interesting challenges on the road ahead. Below are a select few with more up-to-date status in the GitHub issues.

High-level API

As seen earlier, the interface of TF Encrypted has so far been somewhat low-level, roughly matching that of TensorFlow 1.x. This ensured user familiarity and gave us a focal point for adapting and optimizing cryptographic techniques.

However, it also has shortcomings.

One is that expressing models in this way has simply become outdated in light of high-level APIs such as Keras. This is also evident in the upcoming TensorFlow 2.x which fully embraces Keras and similar abstractions.

The second is related to why Keras has likely become so popular, namely its ability to express complex models succinctly and closely to how we think about them. This management of complexity only becomes more relevant when you add notions of distributed data with explicit ownership and privacy policies.

Thirdly, with a low-level API it is easy for users to shoot themselves in the foot and accidentally use operations that are very expensive in the encrypted space. Obtaining good results and figuring out which cryptographic techniques work best for a particular model typically requires some expertise, yet with a low-level API it is hard to incorporate and distribute such knowledge.

As a way of mitigating these issues, we are adding a high-level API to TF Encrypted closely matching Keras, but extended to work nicely with the concepts and constraints inherent in privacy-preserving machine learning. Although still a work in progress, one might imagine rewriting the first example from above as follows.

import tensorflow as tf
import tf_encrypted as tfe

class PredictionClient:

    @tfe.private_input
    def provide_input(self):
        """Load and preprocess input data."""

    @tfe.private_output
    def receive_output(self, logits):
        return tf.print(tf.argmax(logits))

model = tfe.keras.models.Sequential([
    tfe.keras.layers.Dense(activation='relu'),
    tfe.keras.layers.Dense(activation='relu'),
    tfe.keras.layers.Dense(activation=None)
])

prediction_client = PredictionClient()
x = prediction_client.provide_input()

y = model.predict(x)

prediction_client.receive_output(y)

We believe that distilling concepts in this way will improve the ability to accumulate knowledge while retaining a large degree of flexibility.

Pre-trained models

Taking the above mindset further, we also want to encourage the use of pre-trained models and fine-tuning when possible. These provide the least flexibility for users but offer great ways for accumulating expertise and lower user investments.

We plan on providing several well-known models adapted to the encrypted space, thus offering good trade-offs between accuracy and speed.

Tighter TensorFlow integration

Being in the TensorFlow ecosystem has been a huge advantage, providing not only the familiarity and hybrid approach already mentioned, but also allowing us to benefit from an efficient distributed platform with extensive support tools.

As such, it is no surprise that we want full support for one of the most exciting changes coming with TensorFlow 2.x, and the improvements to debugging and exploration that comes with it: eager evaluation by default. While completely abandoning static dataflow graphs would likely have a significant impact on performance, we expect to find reasonable compromises through the new tf.function and static sub-components.

We are also very excited to explore how TF Encrypted can work together with other projects such as TensorFlow Federated and TensorFlow Privacy by adding secure computation to the mix. For instance, TF Encrypted can be used to realize secure aggregation for the former, and can provide a complementary approach to privacy with respect to the latter.

More cryptographic techniques

TF Encrypted has been focused almost exclusively on secure computation based on secret sharing up until this point. However, in certain scenarios and models there are several other techniques that fit more naturally or offer better performance.

We are keen on incorporating these by providing wrappers of some of the excellent projects that already exist, making it easier to experiment and benchmark various combinations of techniques and parameters, and define good defaults.

Push the boundaries

Most research on encrypted deep learning has so far focused on relatively simple models, typically with fewer than a handful of layers.

Moving forward, we need to move beyond toy-like examples and tackle more models commonly used in real-world image analysis and in other domains such as natural language processing. Having the community settle on a few such models will help increase outside interest and bring the field forward by providing a focal point for research.

Data science workflow

While some constraints are currently due to technical maturity, others seem inherent from the fact we now want to keep data private. In other words, even if we had perfect secure computation, with the same performance and scalability properties as plaintext, then we still need to figure out and potentially adapt how we do e.g. data exploration, feature engineering, and production monitoring in the encrypted space.

This area remains largely unexplored and we are excited about digging in further.

Conclusion

Having seen TF Encrypted grow and create interest over the past two years has been an amazing experience, and it is only becoming increasingly clear that the best way to push the field of privacy-preserving machine learning forward is to bring together practitioners from different domains.

As a result, development of the project is now officially by The TF Encrypted Authors with specific attribution given via the Git commit history. For situations where someone needs to take the final decision I remain benevolent dictator, working towards the core beliefs outlined here.

Learn more and become part of the development on GitHub! 🚀

Experimenting with TF Encrypted

2018-10-19T12:00:00+00:00

Privacy-preserving machine learning offers many benefits and interesting applications: being able to train and predict on data while it remains in encrypted form unlocks the utility of data that were previously inaccessible due to privacy concerns. But to make this happen several technical fields must come together, including cryptography, machine learning, distributed systems, and high-performance computing.

The TF Encrypted open source project aims at bringing researchers and practitioners together in a familiar framework in order to accelerate exploration and adaptation. By building directly on TensorFlow it provides a high performance framework with an easy-to-use interface that abstracts away most of the underlying complexity, allowing users with only a basic familiarity with machine learning and TensorFlow to apply state-of-the-art cryptographic techniques without first becoming cross-disciplinary experts.

In this blog post we apply the library to a traditional machine learning example, providing a good starting point for anyone wishing to get into this rapidly growing field.

This is a cross-posting of work done at Dropout Labs with Jason Mancuso.

TensorFlow and TF Encrypted

We start by looking at how our task can be solved in standard TensorFlow and then go through the changes needed to make the predictions private via TF Encrypted. Since the interface of the latter is meant to simulate the simple and concise expression of common machine learning operations that TensorFlow is well-known for, this requires only a small change that highlights what one must inherently think about when moving to the private setting.

Following standard practice, the following script shows our two-layer feedforward network with ReLU activations (more details in the preprint).

Concretely, we consider the classic MNIST digit classification task. To keep things simple we use a small neural network and train it in the traditional way in TensorFlow using an unencrypted training set. However, for making predictions with the trained model we turn to TF Encrypted, and show how two servers can perform predictions for a client without learning anything about its input. While this is a basic yet somewhat standard benchmark in the literature, the techniques used carry over to many different use cases, including medical image analysis.

import tensorflow as tf

# generic functions for loading model weights and input data
def provide_weights(): """Load model weights as TensorFlow objects."""
def provide_input(): """Load input data as TensorFlow objects."""
def receive_output(logits): return tf.print(tf.argmax(logits))

# get model weights/input data (both unencrypted)
w0, b0, w1, b1, w2, b2 = provide_weights()
x = provide_input()

# compute prediction
layer0 = tf.nn.relu((tf.matmul(x, w0) + b0))
layer1 = tf.nn.relu((tf.matmul(layer0, w1) + b1))
logits = tf.matmul(layer2, w2) + b2

# get result of prediction and print
prediction_op = receive_output(logits)

# run graph execution in a tf.Session
with tf.Session() as sess:
    sess.run(prediction_op)

Note that the concrete implementation of provide_weights and provide_input have been left out for the sake of readability. These two methods simply load their respective values from NumPy arrays stored on disk, and return them as tensor objects.

We next turn to making the predictions private, where for the notion of privacy and encryption to even make sense we first need to recast our setting to consider more than the single party implicit in the script above. As seen below, expressing our intentions about who should get to see which values is the biggest difference between the two scripts.

We can naturally identify two of the parties: the prediction client who knows its own input and a model owner who knows the weights. Moreover, for the secure computation protocol chosen here we also need two servers that will be doing the actual computation on encrypted values; this is often desirable in applications where the clients may be mobile devices that have significant restraints on computational power and networking bandwidth.

In summary, our data flow and privacy assumptions are as illustrated in the diagram above. Here a model owner first gives encryptions of the model weights to the two servers in the middle (known as a private input), the prediction client then gives encryptions of its input to the two servers (another private input), who can execute the model and send back encryptions of the prediction result to the client, who can finally decrypt; at no point can the two servers decrypt any values. Below we see our script expressing these privacy assumptions.

import tensorflow as tf
import tf_encrypted as tfe

# generic functions for loading model weights and input data on each party
def provide_weights(): """Loads the model weights on the model-owner party."""
def provide_input(): """Loads the input data on the prediction-client party."""
def receive_output(): return tf.print(tf.argmax(logits))

# get model weights/input data as private tensors from each party
w0, b0, w1, b1, w2, b2 = tfe.define_private_input("model-owner", provide_weights)
x = tfe.define_private_input("prediction-client", provide_input)

# compute secure prediction
layer0 = tfe.relu((tfe.matmul(x, w0) + b0))
layer1 = tfe.relu((tfe.matmul(layer0, w1) + b1))
logits = tfe.matmul(layer1, w2) + b2

# send prediction output back to client
prediction_op = tfe.define_output("prediction-client", receive_output, logits)

# run secure graph execution in a tfe.Session
with tfe.Session() as sess:
    sess.run(prediction_op)

Note that most of the code remains essentially identical to the traditional TensorFlow code, using tfe instead of tf:

The provide_weights method for loading model weights is now wrapped in a call to tfe.define_private_input in order to specify they should be owned and restricted to the model owner; by wrapping the method call, TF Encrypted will encrypt them before sharing with other parties in the computation.
As with the weights, the prediction input is now also only accessible to the prediction client, who is also the only receiver of the output. Here the tf.print statement has been moved into receive_output as this is now the only point where the result is known in plaintext.
We also tie the name of parties to their network hosts. Although omitted here, this information also needs to be available on these hosts, as typically shared via a simple configuration file.

What’s the Point?

user-friendly: very little boilerplate, very similar to traditional TensorFlow.
abstract and modular: it integrates secure computation tightly with machine learning code, hiding advanced cryptographic operations underneath normal tensor operations.
extensible: new protocols and techniques can be added under the hood, and the high-level API won’t change. Similarly, new machine learning layers can be added and defined on top of each underlying protocol as needed, just like in normal TensorFlow.
fast: all of this is computed efficiently since it gets compiled down to ordinary TensorFlow graphs, and can hence take advantage of the optimized primitives for distributed computation that the TensorFlow backend provides.

These properties also make it easy to benchmark a diverse set of combinations of machine learning models and secure computation protocols. This allows for more fair comparisons, more confident experimental results, and a more rigorous empirical science, all while lowering the barrier to entry to private machine learning.

Finally, by operating directly in TensorFlow we also benefit from its ecosystem and can take advantage of existing tools such as TensorBoard. For instance, one can profile which operations are most expensive and where additional optimizations should be applied, and one can inspect where values reside and ensure correctness and security during implementation of the cryptographic protocols as shown below.

Here, we visualize the various operations that make up a secure operation on two private values. Each of the nodes in the underlying computation graph are shaded according to which machine aided that node’s execution, and it comes with handy information about data flow and execution time. This gives the user a completely transparent yet effective way of auditing secure computations, while simultaneously allowing for program debugging.

Conclusion

TF Encrypted is about providing researchers and practitioners with the open-source tools they need to quickly experiment with secure protocols and primitives for private machine learning.

The hope is that this will aid and inspire the next generation of researchers to implement their own novel protocols and techniques for secure computation in a fraction of the time, so that machine learning engineers can start to apply these techniques for their own use cases in a framework they’re already intimately familiar with.

To find out more have a look at the recent preprint or dive into the examples on GitHub!

Secure Computations as Dataflow Programs

2018-03-01T12:00:00+00:00

TL;DR: using TensorFlow as a distributed computation framework for dataflow programs we give a full implementation of the SPDZ protocol with networking, in turn enabling optimised machine learning on encrypted data.

Unlike earlier where we focused on the concepts behind secure computation as well as potential applications, here we build a fully working (passively secure) implementation with players running on different machines and communicating via typical network stacks. And as part of this we investigate the benefits of using a modern distributed computation platform when experimenting with secure computations, as opposed to building everything from scratch.

Additionally, this can also be seen as a step in the direction of getting private machine learning into the hands of practitioners, where integration with existing and popular tools such as TensorFlow plays an important part. Concretely, while we here only do a relatively shallow integration that doesn’t make use of some of the powerful tools that comes with TensorFlow (e.g. automatic differentiation), we do show how basic technical obstacles can be overcome, potentially paving the way for deeper integrations.

Jumping ahead, it is clear in retrospect that TensorFlow is an obvious candidate framework for quickly experimenting with secure computation protocols, at the very least in the context of private machine learning.

All code is available to play with, either locally or on the Google Cloud. To keep it simple our running example throughout is private prediction using logistic regression, meaning that given a private (i.e. encrypted) input x we wish to securely compute sigmoid(dot(w, x) + b) for private but pre-trained weights w and bias b (private training of w and b is considered in a follow-up post). Experiments show that for a model with 100 features this can be done in TensorFlow with a latency as low as 60ms and at a rate of up to 20,000 prediction per second.

A big thank you goes out to Andrew Trask, Kory Mathewson, Jan Leike, and the OpenMined community for inspiration and interesting discussions on this topic!

Disclaimer: this implementation is meant for experimentation only and may not live up to required security. In particular, TensorFlow does not currently seem to have been designed with this application in mind, and although it does not appear to be the case right now, may for instance in future versions perform optimisations behind that scene that break the intended security properties. More notes below.

Motivation

As hinted above, implementing secure computation protocols such as SPDZ is a non-trivial task due to their distributed nature, which is only made worse when we start to introduce various optimisations (but it can be done). For instance, one has to consider how to best orchestrate the simultanuous execution of multiple programs, how to minimise the overhead of sending data across the network, and how to efficient interleave it with computation so that one server only rarely waits on the other. On top of that, we might also want to support different hardware platforms, including for instance both CPUs and GPUs, and for any serious work it is highly valuable to have tools for visual inspection, debugging, and profiling in order to identify issues and bottlenecks.

It should furthermore also be easy to experiment with various optimisations, such as transforming the computation for improved performance, reusing intermediate results and masked values, and supplying fresh “raw material” in the form of triples during the execution instead of only generating a large batch ahead of time in an offline phase. Getting all this right can be overwhelming, which is one reason earlier blog posts here focused on the principles behind secure computation protocols and simply did everything locally.

Luckily though, modern distributed computation frameworks such as TensorFlow are receiving a lot of research and engineering attention these days due to their use in advanced machine learning on large data sets. And since our focus is on private machine learning there is a natural large fundamental overlap. In particular, the secure operations we are interested in are tensor addition, subtraction, multiplication, dot products, truncation, and sampling, which all have insecure but highly optimised counterparts in TensorFlow.

Prerequisites

We make the assumption that the main principles behind both TensorFlow and the SPDZ protocol are already understood – if not then there are plenty of good resources for the former (including whitepapers) and e.g. previous blog posts for the latter. As for the different parties involved, we also here assume a setting with two server, a crypto producer, an input provider, and an output receiver.

One important note though is that TensorFlow works by first constructing a static computation graph that is subsequently executed in a session. For instance, inspecting the graph we get from sigmoid(dot(w, x) + b) in TensorBoard shows the following.

This means that our efforts in this post are concerned with building such a graph, as opposed to actual execution as earlier: we are to some extend making a small compiler that translates secure computations expressed in a simple language into TensorFlow programs. As a result we benefit not only from working at a higher level of abstraction but also from the large amount of efforts that have already gone into optimising graph execution in TensorFlow.

See the experiments for a full code example.

Basics

Our needs fit nicely with the operations already provided by TensorFlow as seen next, with one main exception: to match typical precision of floating point numbers when instead working with fixedpoint numbers in the secure setting, we end up encoding into and operating on integers that are larger than what fits in the typical word sizes of 32 or 64 bits, yet today these are the only sizes for which TensorFlow provides operations (a constraint that may have something to do with current support on GPUs).

Luckily though, for the operations we need there are efficient ways around this that allow us to simulate arithmetic on tensors of ~120 bit integers using a list of tensors with identical shape but of e.g. 32 bit integers. And this decomposition moreover has the nice property that we can often operate on each tensor in the list independently, so in addition to enabling the use of TensorFlow, this also allows most operations to be performed in parallel and can actually increase efficiency compared to operating on single larger numbers, despite the fact that it may initially sound more expensive.

We discuss the details of this elsewhere and for the rest of this post simply assume operations crt_add, crt_sub, crt_mul, crt_dot, crt_mod, and sample that perform the expected operations on lists of tensors. Note that crt_mod, crt_mul, and crt_sub together allow us to define a right shift operation for fixedpoint truncation.

Private tensors

Each private tensor is determined by two shares, one of each server. And for the reasons mentioned above, each share is given by a list of tensors, which is represented by a list of nodes in the graph. To hide this complexity we introduce a simple class as follows.

class PrivateTensor:
    
    def __init__(self, share0, share1):
        self.share0 = share0
        self.share1 = share1
    
    @property
    def shape(self):
        return self.share0[0].shape
        
    @property
    def unwrapped(self):
        return self.share0, self.share1

And thanks to TensorFlow we can know the shape of tensors at graph creation time, meaning we don’t have to keep track of this ourselves.

Simple operations

Since a secure operation will often be expressed in terms of several TensorFlow operations, we use abstract operations such as add, mul, and dot as a convenient way of constructing the computation graph. The first one is add, where the resulting graph simply instructs the two servers to locally combine the shares they each have using a subgraph constructed by crt_add.

def add(x, y):
    assert isinstance(x, PrivateTensor)
    assert isinstance(y, PrivateTensor)
    
    x0, x1 = x.unwrapped
    y0, y1 = y.unwrapped
    
    with tf.name_scope('add'):
    
        with tf.device(SERVER_0):
            z0 = crt_add(x0, y0)

        with tf.device(SERVER_1):
            z1 = crt_add(x1, y1)

    z = PrivateTensor(z0, z1)
    return z

Notice how easy it is to use tf.device() to express which server is doing what! This command ties the computation and its resulting value to the specified host, and instructs TensorFlow to automatically insert appropiate networking operations to make sure that the input values are available when needed!

As an example, in the above, if x0 was previous on, say, the input provider then TensorFlow will insert send and receive instructions that copies it to SERVER_0 as part of computing add. All of this is abstracted away and the framework will attempt to figure out the best strategy for optimising exactly when to perform sends and receives, including batching to better utilise the network and keeping the compute units busy.

The tf.name_scope() command on the other hand is simply a logical abstraction that doesn’t influence computations but can be used to make the graphs much easier to visualise in TensorBoard by grouping subgraphs as single components as also seen earlier.

Note that by selecting Device coloring in TensorBoard as done above we can also use it to verify where the operations were actually computed, in this case that addition was indeed done locally by the two servers (green and turquoise).

Dot products

We next turn to dot products. This is more complex, not least since we now need to involve the crypto producer, but also since the two servers have to communicate with each other as part of the computation.

def dot(x, y):
    assert isinstance(x, PrivateTensor)
    assert isinstance(y, PrivateTensor)

    x0, x1 = x.unwrapped
    y0, y1 = y.unwrapped

    with tf.name_scope('dot'):

        # triple generation

        with tf.device(CRYPTO_PRODUCER):
            a = sample(x.shape)
            b = sample(y.shape)
            ab = crt_dot(a, b)
            a0, a1 = share(a)
            b0, b1 = share(b)
            ab0, ab1 = share(ab)
        
        # masking after distributing the triple
        
        with tf.device(SERVER_0):
            alpha0 = crt_sub(x0, a0)
            beta0  = crt_sub(y0, b0)

        with tf.device(SERVER_1):
            alpha1 = crt_sub(x1, a1)
            beta1  = crt_sub(y1, b1)

        # recombination after exchanging alphas and betas

        with tf.device(SERVER_0):
            alpha = reconstruct(alpha0, alpha1)
            beta  = reconstruct(beta0, beta1)
            z0 = crt_add(ab0,
                 crt_add(crt_dot(a0, beta),
                 crt_add(crt_dot(alpha, b0),
                         crt_dot(alpha, beta))))

        with tf.device(SERVER_1):
            alpha = reconstruct(alpha0, alpha1)
            beta  = reconstruct(beta0, beta1)
            z1 = crt_add(ab1,
                 crt_add(crt_dot(a1, beta),
                         crt_dot(alpha, b1)))
        
    z = PrivateTensor(z0, z1)
    z = truncate(z)
    return z

However, with tf.device() we see that this is still relatively straight-forward, at least if the protocol for secure dot products is already understood. We first construct a graph that makes the crypto producer generate a new dot triple. The output nodes of this graph is a0, a1, b0, b1, ab0, ab1

With crt_sub we then build graphs for the two servers that mask x and y using a and b respectively. TensorFlow will again take care of inserting networking code that sends the value of e.g. a0 to SERVER_0 during execution.

In the third step we reconstruct alpha and beta on each server, and compute the recombination step to get the dot product. Note that we have to define alpha and beta twice, one for each server, since although they contain the same value, if we had instead define them only on one server but used them on both, then we would implicitly have inserted additional networking operations and hence slowed down the computation.

Returning to TensorBoard we can verify that the nodes are indeed tied to the correct players, with yellow being the crypto producer, and green and turquoise being the two servers. Note the convenience of having tf.name_scope() here.

Configuration

To fully claim that this has made the distributed aspects of secure computations much easier to express, we also have to see what is actually needed for td.device() to work as intended. In the code below we first define an arbitrary job name followed by identifiers for our five players. More interestingly, we then simply specify their network hosts and wrap this in a ClusterSpec. That’s it!

JOB_NAME = 'spdz'

SERVER_0        = '/job:{}/task:0'.format(JOB_NAME)
SERVER_1        = '/job:{}/task:1'.format(JOB_NAME)
CRYPTO_PRODUCER = '/job:{}/task:2'.format(JOB_NAME)
INPUT_PROVIDER  = '/job:{}/task:3'.format(JOB_NAME)
OUTPUT_RECEIVER = '/job:{}/task:4'.format(JOB_NAME)

HOSTS = [
    '10.132.0.4:4440',
    '10.132.0.5:4441',
    '10.132.0.6:4442',
    '10.132.0.7:4443',
    '10.132.0.8:4444',
]

CLUSTER = tf.train.ClusterSpec({
    JOB_NAME: HOSTS
})

Note that in the screenshots we are actually running the input provider and output receiver on the same host, and hence both show up as 3/device:CPU:0.

Finally, the code that each player executes is about as simple as it gets.

server = tf.train.Server(CLUSTER, job_name=JOB_NAME, task_index=ROLE)
server.start()
server.join()

Here the value of ROLE is the only thing that differs between the programs the five players run and typically given as a command-line argument.

Improvements

With the basics in place we can look at a few optimisations.

Tracking nodes

Our first improvement allows us to reuse computations. For instance, if we need the result of dot(x, y) twice then we want to avoid computing it a second time and instead reuse the first. Concretely, we want to keep track of nodes in the graph and link back to them whenever possible.

To do this we simply maintain a global dictionary of PrivateTensor references as we build the graph, and use this for looking up already existing results before adding new nodes. For instance, dot now becomes as follows.

def dot(x, y):
    assert isinstance(x, PrivateTensor)
    assert isinstance(y, PrivateTensor)

    node_key = ('dot', x, y)
    z = nodes.get(node_key, None)

    if z is None:

        # ... as before ...

        z = PrivateTensor(z0, z1)
        z = truncate(z)
        nodes[node_key] = z

    return z

While already significant for some applications, this change also opens up for our next improvement.

Reusing masked tensors

We have already mentioned that we’d ideally want to mask every private tensor at most once to primarily save on networking. For instance, if we are computing both dot(w, x) and dot(w, y) then we want to use the same masked version of w in both. Specifically, if we are doing many operations with the same masked tensor then the cost of masking it can be amortised away.

But with the current setup we mask every time we compute e.g. dot or mul since masking is baked into these. So to avoid this we simply make masking an explicit operation, additionally allowing us to also use the same masked version across different operations.

def mask(x):
    assert isinstance(x, PrivateTensor)
    
    node_key = ('mask', x)
    masked = nodes.get(node_key, None)

    if masked is None:
      
        x0, x1 = x.unwrapped
        shape = x.shape
      
        with tf.name_scope('mask'):

            with tf.device(CRYPTO_PRODUCER):
                a = sample(shape)
                a0, a1 = share(a)

            with tf.device(SERVER_0):
                alpha0 = crt_sub(x0, a0)

            with tf.device(SERVER_1):
                alpha1 = crt_sub(x1, a1)

            # exchange of alphas

            with tf.device(SERVER_0):
                alpha_on_0 = reconstruct(alpha0, alpha1)

            with tf.device(SERVER_1):
                alpha_on_1 = reconstruct(alpha0, alpha1)

        masked = MaskedPrivateTensor(a, a0, a1, alpha_on_0, alpha_on_1)
        nodes[node_key] = masked
        
    return masked

Note that we introduce a MaskedPrivateTensor class as part of this, which is again simply a convenient way of abstracting over the five lists of tensors we get from mask(x).

class MaskedPrivateTensor(object):

    def __init__(self, a, a0, a1, alpha_on_0, alpha_on_1):
        self.a  = a
        self.a0 = a0
        self.a1 = a1
        self.alpha_on_0 = alpha_on_0
        self.alpha_on_1 = alpha_on_1

    @property
    def shape(self):
        return self.a[0].shape

    @property
    def unwrapped(self):
        return self.a, self.a0, self.a1, self.alpha_on_0, self.alpha_on_1

With this we may rewrite dot as below, which is now only responsible for the recombination step.

def dot(x, y):
    assert isinstance(x, PrivateTensor) or isinstance(x, MaskedPrivateTensor)
    assert isinstance(y, PrivateTensor) or isinstance(y, MaskedPrivateTensor)

    node_key = ('dot', x, y)
    z = nodes.get(node_key, None)

    if z is None:
      
        if isinstance(x, PrivateTensor): x = mask(x)
        if isinstance(y, PrivateTensor): y = mask(y)

        a, a0, a1, alpha_on_0, alpha_on_1 = x.unwrapped
        b, b0, b1,  beta_on_0,  beta_on_1 = y.unwrapped

        with tf.name_scope('dot'):

            with tf.device(CRYPTO_PRODUCER):
                ab = crt_dot(a, b)
                ab0, ab1 = share(ab)

            with tf.device(SERVER_0):
                alpha = alpha_on_0
                beta = beta_on_0
                z0 = crt_add(ab0,
                     crt_add(crt_dot(a0, beta),
                     crt_add(crt_dot(alpha, b0),
                             crt_dot(alpha, beta))))

            with tf.device(SERVER_1):
                alpha = alpha_on_1
                beta = beta_on_1
                z1 = crt_add(ab1,
                     crt_add(crt_dot(a1, beta),
                             crt_dot(alpha, b1)))

        z = PrivateTensor(z0, z1)
        z = truncate(z)
        nodes[node_key] = z

    return z

As a verification we can see that TensorBoard shows us the expected graph structure, in this case inside the graph for sigmoid.

Here the value of square(x) is first computed, then masked, and finally reused in four multiplications.

There is an inefficiency though: while the dataflow nature of TensorFlow will in general take care of only recomputing the parts of the graph that have changed between two executions, this does not apply to operations involving sampling via e.g. tf.random_uniform, which is used in our sharing and masking. Consequently, masks are not being reused across executions.

Caching values

To get around the above issue we can introduce caching of values that survive across different executions of the graph, and an easy way of doing this is to store tensors in variables. Normal executions will read from these, while an explicit cache_populators set of operations allow us to populated them.

For example, wrapping our two tensors w and b with such cache operation gets us the following graph.

When executing the cache population operations TensorFlow automatically figures out which subparts of the graph it needs to execute to generate the values to be cached, and which can be ignored.

And likewise when predicting, in this case skipping sharing and masking.

Buffering triples

Recall that a main purpose of triples is to move the computation of the crypto producer to an offline phase and distribute its results to the two servers ahead of time in order to speed up their computation later during the online phase.

So far we haven’t done anything to specify that this should happen though, and from reading the above code it’s not unreasonable to assume that the crypto producer will instead compute in synchronisation with the two servers, injecting idle waiting periods throughout their computation. However, from experiments it seems that TensorFlow is already smart enough to optimise the graph to do the right thing and batch triple distribution, presumably to save on networking. We still have an initial waiting period though, that we could get rid of by introducing a separate compute-and-distribute execution that fills up buffers.

We’ll skip this issue for now and instead return to it when looking at private training since it is not unreasonable to expect significant performance improvements there from distributing the training data ahead of time.

Profiling

As a final reason to be excited about building dataflow programs in TensorFlow we also look at the built-in runtime statistics. We have already seen the built-in detailed tracing support above, but in TensorBoard we can also easily see how expensive each operation was both in terms of compute and memory. The numbers reported here are from the experiments below.

The heatmap above e.g. shows that sigmoid was the most expensive operation in the run and that the dot product took roughly 30ms to execute. Moreover, in the below figure we have navigated further into the dot block and see that sharing in this particular run taking about 3ms.

This way we can potentially identify bottlenecks and compare performance of different approaches. And if needed we can of course switch to tracing for even more details.

Experiments

The GitHub repository contains the code needed for experimentation, including examples and instructions for setting up either a local configuration or a GCP configuration of hosts. For the running example of private prediciton using a logistic regression model we use the GCP configuration, i.e. the parties are running on different virtual hosts located in the same Google Cloud zone, here on some of the weaker instances, namely dual core and 10GB memory.

A slightly simplified version of our program is as follows, where we first train a model in public, build a graph for the private prediction computation, and then run it in a fresh session. The model was somewhat arbitrarily picked to have 100 features.

from config import session
from tensorspdz import (
    define_input, define_variable,
    add, dot, sigmoid, cache, mask,
    encode_input, decode_output
)

# publicly train `weights` and `bias`
weights, bias = train_publicly()

# define shape of unknown input
shape_x = X.shape

# construct graph for private prediction
input_x, x = define_input(shape_x, name='x')

init_w, w = define_variable(weights, name='w')
init_b, b = define_variable(bias, name='b')

if use_caching:
    w = cache(mask(w))
    b = cache(b)

y = sigmoid(add(dot(x, w), b))

# start session between all players
with session() as sess:

    # share and distribute `weights` and `bias` to the two servers
    sess.run([init_w, init_b])
    
    if use_caching:
        # compute and store cached values
        sess.run(cache_populators)

    # prepare to use `X` as private input for prediction
    feed_dict = encode_input([
        (input_x, X)
    ])

    # run secure computation and reveal output
    y_pred = sess.run(reveal(y), feed_dict=feed_dict)
    
    print decode_output(y_pred)

Running this a few times with different sizes of X gives the timings below, where the entire computation is considered including triple generation and distribution; slightly surprisingly there were no real difference between caching masked values or not.

Processing batches of size 1, 10, and 100 took roughly the same time, ~60ms on average, which might suggest a lower latency bound due to networking. At 1000 the time jumps to ~110ms, at 10,000 to ~600ms, and finally at 100,000 to ~5s. As such, if latency is important than we can perform ~1600 predictions per second, while if more flexible then at least ~20,000 per second.

This however is measuring only timings reported by profiling, with actual execution time taking a bit longer; hopefully some of the production-oriented tools such as tf.serving that come with TensorFlow can improve on this.

Thoughts

After private prediction it’ll of course also be interesting to look at private training. Caching of masked training data might be more relevant here since it remains fixed throughout the process.

The serving of models can also be improved, using for instance the production-ready tf.serving one might be able to avoid much of the current initial overhead for orchestration, as well as having endpoints that can be safely exposed to the public.

Finally, there are security improvements to be made on e.g. communication between the five parties. In particular, in the current version of TensorFlow all communication is happening over unencrypted and unauthenticated gRPC connections, which means that someone listening in on the network traffic in principle could learn all private values. Since support for TLS is already there in gRPC it might be straight-forward to make use of it in TensorFlow without a significant impact on performance. Likewise, TensorFlow does not currently use a strong pseudo-random generator for tf.random_uniform and hence sharing and masking are not as secure as they could be; adding an operation for cryptographically strong randomness might be straight-forward and should give roughly the same performance.

Private Image Analysis with MPC

2017-09-19T12:00:00+00:00

TL;DR: we take a typical CNN deep learning model and go through a series of steps that enable both training and prediction to instead be done on encrypted data.

Using deep learning to analyse images through convolutional neural networks (CNNs) has gained enormous popularity over the last few years due to their success in out-performing many other approaches on this and related tasks.

One recent application took the form of skin cancer detection, where anyone can quickly take a photo of a skin lesion using a mobile phone app and have it analysed with “performance on par with [..] experts” (see the associated video for a demo). Having access to a large set of clinical photos played a key part in training this model – a data set that could be considered sensitive.

Which brings us to privacy and eventually secure multi-party computation (MPC): how many applications are limited today due to the lack of access to data? In the above case, could the model be improved by letting anyone with a mobile phone app contribute to the training data set? And if so, how many would volunteer given the risk of exposing personal health related information?

With MPC we can potentially lower the risk of exposure and hence increase the incentive to participate. More concretely, by instead performing the training on encrypted data we can prevent anyone from ever seeing not only individual data, but also the learned model parameters. Further techniques such as differential privacy could additionally be used to hide any leakage from predictions as well, but we won’t go into that here.

In this blog post we’ll look at a simpler use case for image analysis but go over all required techniques. A few notebooks are presented along the way, with the main one given as part of the proof of concept implementation.

Slides from a more recent presentation at the Paris Machine Learning meetup are now also available.

A big thank you goes out to Andrew Trask, Nigel Smart, Adrià Gascón, and the OpenMined community for inspiration and interesting discussions on this topic! Jakukyo Friel has also very kindly made a Chinese translation.

Setting

We will assume that the training data set is jointly held by a set of input providers and that the training is performed by two distinct servers (or parties) that are trusted not to collaborate beyond what our protocol specifies. In practice, these servers could for instance be virtual instances in a shared cloud environment operated by two different organisations.

The input providers are only needed in the very beginning to transmit their (encrypted) training data; after that all computations involve only the two servers, meaning it is indeed plausible for the input providers to use e.g. mobile phones. Once trained, the model will remain jointly held in encrypted form by the two servers where anyone can use it to make further encrypted predictions.

For technical reasons we also assume a distinct crypto producer that generates certain raw material used during the computation for increased efficiency; there are ways to eliminate this additional entity but we won’t go into that here.

Finally, in terms of security we aim for a typical notion used in practice, namely honest-but-curious (or passive) security, where the servers are assumed to follow the protocol but may otherwise try to learn as much possible from the data they see. While a slightly weaker notion than fully malicious (or active) security with respect to the servers, this still gives strong protection against anyone who may compromise one of the servers after the computations, despite what they do. Note that for the purpose of this blog post we will actually allow a small privacy leakage during training as detailed later.

Image Analysis with CNNs

Our use case is the canonical MNIST handwritten digit recognition, namely learning to identify the Arabic numeral in a given image, and we will use the following CNN model from a Keras example as our base.

feature_layers = [
    Conv2D(32, (3, 3), padding='same', input_shape=(28, 28, 1)),
    Activation('relu'),
    Conv2D(32, (3, 3), padding='same'),
    Activation('relu'),
    MaxPooling2D(pool_size=(2,2)),
    Dropout(.25),
    Flatten()
]

classification_layers = [
    Dense(128),
    Activation('relu'),
    Dropout(.50),
    Dense(NUM_CLASSES),
    Activation('softmax')
]

model = Sequential(feature_layers + classification_layers)

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])

model.fit(
    x_train, y_train,
    epochs=1,
    batch_size=32,
    verbose=1,
    validation_data=(x_test, y_test))

We won’t go into the details of this model here since the principles are already well-covered elsewhere, but the basic idea is to first run an image through a set of feature layers that transforms the raw pixels of the input image into abstract properties that are more relevant for our classification task. These properties are then subsequently combined by a set of classification layers to yield a probability distribution over the possible digits. The final outcome is then typically simply the digit with highest assigned probability.

As we shall see, using Keras has the benefit that we can perform quick experiments on unencrypted data to get an idea of the performance of the model itself, as well as providing a simple interface to later mimic in the encrypted setting.

Secure Computation with SPDZ

With CNNs in place we next turn to MPC. For this we will use the state-of-the-art SPDZ protocol as it allows us to only have two servers and to improve online performance by moving certain computations to an offline phase as described in detail in earlier blog posts.

As typical in secure computation protocols, all computations take place in a field, here identified by a prime Q. This means we need to encode the floating-point numbers used by the CNNs as integers modulo a prime, which puts certain constraints on Q and in turn has an affect on performance.

Moreover, recall that in interactive computations such as the SPDZ protocol it becomes relevant to also consider communication and round complexity, in addition to the typical time complexity. Here, the former measures the number of bits sent across the network, which is a relatively slow process, and the latter the number of synchronisation points needed between the two servers, which may block one of them with nothing to do until the other catches up. Both hence also have a big impact on overall executing time.

Most importantly however, is that the only “native” operations we have in these protocols is addition and multiplication. Division, comparison, etc. can be done, but are more expensive in terms of our three performance measures. Later we shall see how to mitigate some of the issues raised due to this, but here we first recall the basic SPDZ protocol.

Tensor operations

When we introduced the SPDZ protocol earlier we did so in the form of classes PublicValue and PrivateValue representing respectively a (scalar) value known in clear by both servers and an encrypted value known only in secret shared form. In this blog post, we now instead present it more naturally via classes PublicTensor and PrivateTensor that reflect the heavy use of tensors in our deep learning setting.

class PrivateTensor:
    
    def __init__(self, values, shares0=None, shares1=None):
        if not values is None:
            shares0, shares1 = share(values)
        self.shares0 = shares0
        self.shares1 = shares1
    
    def reconstruct(self):
        return PublicTensor(reconstruct(self.shares0, self.shares1))
        
    def add(x, y):
        if type(y) is PublicTensor:
            shares0 = (x.values + y.shares0) % Q
            shares1 =             y.shares1
            return PrivateTensor(None, shares0, shares1)
        if type(y) is PrivateTensor:
            shares0 = (x.shares0 + y.shares0) % Q
            shares1 = (x.shares1 + y.shares1) % Q
            return PrivateTensor(None, shares0, shares1)
            
    def mul(x, y):
        if type(y) is PublicTensor:
            shares0 = (x.shares0 * y.values) % Q
            shares1 = (x.shares1 * y.values) % Q
            return PrivateTensor(None, shares0, shares1)
        if type(y) is PrivateTensor:
            a, b, a_mul_b = generate_mul_triple(x.shape, y.shape)
            alpha = (x - a).reconstruct()
            beta  = (y - b).reconstruct()
            return alpha.mul(beta) + \
                   alpha.mul(b) + \
                   a.mul(beta) + \
                   a_mul_b

As seen, the adaptation is pretty straightforward using NumPy and the general form of for instance PrivateTensor is almost exactly the same, only occationally passing a shape around as well. There are a few technical details however, all of which are available in full in the associated notebook.

def share(secrets):
    shares0 = sample_random_tensor(secrets.shape)
    shares1 = (secrets - shares0) % Q
    return shares0, shares1

def reconstruct(shares0, shares1):
    secrets = (shares0 + shares1) % Q
    return secrets

def generate_mul_triple(x_shape, y_shape):
    a = sample_random_tensor(x_shape)
    b = sample_random_tensor(y_shape)
    c = np.multiply(a, b) % Q
    return PrivateTensor(a), PrivateTensor(b), PrivateTensor(c)

As such, perhaps the biggest difference is in the above base utility methods where this shape is used.

Adapting the Model

While it is in principle possible to compute any function securely with what we already have, and hence also the base model from above, in practice it is relevant to first consider variants of the model that are more MPC friendly, and vice versa. In slightly more picturesque words, it is common to open up our two black boxes and adapt the two technologies to better fit each other.

The root of this comes from some operations being surprisingly expensive in the encrypted setting. We saw above that addition and multiplication are relatively cheap, yet comparison and division with private denominator are not. For this reason we make a few changes to the model to avoid these.

The various changes presented in this section as well as their simulation performances are available in full in the associated Python notebook.

Optimizer

The first issue involves the optimizer: while Adam is a preferred choice in many implementations for its efficiency, it also involves taking a square root of a private value and using one as the denominator in a division. While it is theoretically possible to compute these securely, in practice it could be a significant bottleneck for performance and hence relevant to avoid.

A simple remedy is to switch to the momentum SGD optimizer, which may imply longer training time but only uses simple operations.

model.compile(
    loss='categorical_crossentropy', 
    optimizer=SGD(clipnorm=10000, clipvalue=10000),
    metrics=['accuracy'])

An additional caveat is that many optimizers use clipping to prevent gradients from growing too small or too large. This requires a comparison on private values, which again is a somewhat expensive operation in the encrypted setting, and as a result we aim to avoid using this technique altogether. To get realistic results from our Keras simulation we increase the bounds as seen above.

Layers

Speaking of comparisons, the ReLU and max-pooling layers poses similar problems. In CryptoNets the former is replaced by a squaring function and the latter by average pooling, while SecureML implements a ReLU-like activation function by adding complexity that we wish to avoid to keep things simple. As such, we here instead use higher-degree sigmoid activation functions and average-pooling layers. Note that average-pooling also uses a division, yet this time the denominator is a public value, and hence division is simply a public inversion followed by a multiplication.

feature_layers = [
    Conv2D(32, (3, 3), padding='same', input_shape=(28, 28, 1)),
    Activation('sigmoid'),
    Conv2D(32, (3, 3), padding='same'),
    Activation('sigmoid'),
    AveragePooling2D(pool_size=(2,2)),
    Dropout(.25),
    Flatten()
]

classification_layers = [
    Dense(128),
    Activation('sigmoid'),
    Dropout(.50),
    Dense(NUM_CLASSES),
    Activation('softmax')
]

model = Sequential(feature_layers + classification_layers)

Simulations indicate that with this change we now have to bump the number of epochs, slowing down training time by an equal factor. Other choices of learning rate or momentum may improve this.

model.fit(
    x_train, y_train,
    epochs=15,
    batch_size=32,
    verbose=1,
    validation_data=(x_test, y_test))

The remaining layers are easily dealt with. Dropout and flatten do not care about whether we’re in an encrypted or unencrypted setting, and dense and convolution are matrix dot products which only require basic operations.

Softmax and loss function

The final softmax layer also causes complications for training in the encrypted setting as we need to compute both an exponentiation using a private exponent as well as normalisation in the form of a division with a private denominator.

While both remain possible we here choose a much simpler approach and allow the predicted class likelihoods for each training sample to be revealed to one of the servers, who can then compute the result from the revealed values. This of course results in a privacy leakage that may or may not pose an acceptable risk.

One heuristic improvement is for the servers to first permute the vector of class likelihoods for each training sample before revealing anything, thereby hiding which likelihood corresponds to which class. However, this may be of little effect if e.g. “healthy” often means a narrow distribution over classes while “sick” means a spread distribution.

Another is to introduce a dedicated third server who only does this small computation, doesn’t see anything else from the training data, and hence cannot relate the labels with the sample data. Something is still leaked though, and this quantity is hard to reason about.

Finally, we could also replace this one-vs-all approach with an one-vs-one approach using e.g. sigmoids. As argued earlier this allows us to fully compute the predictions without decrypting. We still need to compute the loss however, which could be done by also considering a different loss function.

Note that none of the issues mentioned here occur when later performing predictions using the trained network, as there is no loss to be computed and the servers can there simply skip the softmax layer and let the recipient of the prediction compute it himself on the revealed values: for him it’s simply a question of how the values are interpreted.

Transfer Learning

At this point it seems that we can actually train the model as-is and get decent results. But as often done in CNNs we can get significant speed-ups by employing transfer learning; in fact, it is somewhat well-known that “very few people train their own convolutional net from scratch because they don’t have sufficient data” and that “it is always recommended to use transfer learning in practice”.

A particular application to our setting here is that training may be split into a pre-training phase using non-sensitive public data and a fine-tuning phase using sensitive private data. For instance, in the case of a skin cancer detector, the researchers may choose to pre-train on a public set of photos and then afterwards ask volunteers to improve the model by providing additional photos.

Moreover, besides a difference in cardinality, there is also room for differences in the two data sets in terms of subjects, as CNNs have a tendency to first decompose these into meaningful subcomponents, the recognition of which is what is being transferred. In other words, the technique is strong enough for pre-training to happen on a different type of images than fine-tuning.

Returning to our concrete use-case of character recognition, we will let the “public” images be those of digits 0-4 and the “private” images be those of digits 5-9. As an alternative, it doesn’t seem unreasonable to instead have used for instance characters a-z as the former and digits 0-9 as the latter.

Pre-train on public dataset

In addition to avoiding the overhead of training on encrypted data for the public dataset, we also benefit from being able to train with more advanced optimizers. Here for instance, we switch back to the Adam optimizer for the public images and can take advantage of its improved training time. In particular, we see that we can again lower the number of epochs needed.

(x_train, y_train), (x_test, y_test) = public_dataset

model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

model.fit(
    x_train, y_train,
    epochs=1,
    batch_size=32,
    verbose=1,
    validation_data=(x_test, y_test))

Once happy with this the servers simply shares the model parameters and move on to training on the private dataset.

Fine-tune on private dataset

While we now begin encrypted training on model parameters that are already “half-way there” and hence can be expected to require fewer epochs, another benefit of transfer learning, as mentioned above, is that recognition of subcomponents tend to happen in the lower layers of the network and may in some cases be used as-is. As a result, we now freeze the parameters of the feature layers and focus training efforts exclusively on the classification layers.

for layer in feature_layers:
    layer.trainable = False

Note however that we still need to run all private training samples forward through these layers; the only difference is that we skip them in the backward step and that there are few parameters to train.

Training is then performed as before, although now using a lower learning rate.

(x_train, y_train), (x_test, y_test) = private_dataset

model.compile(
    loss='categorical_crossentropy',
    optimizer=SGD(clipnorm=10000, clipvalue=10000, lr=0.1, momentum=0.0),
    metrics=['accuracy'])

model.fit(
    x_train, y_train,
    epochs=5,
    batch_size=32,
    verbose=1,
    validation_data=(x_test, y_test))

In the end we go from 25 epochs to 5 epochs in the simulations.

Preprocessing

There are few preprocessing optimisations one could also apply but that we won’t consider further here.

The first is to move the computation of the frozen layers to the input provider so that it’s the output of the flatten layer that is shared with the servers instead of the pixels of the images. In this case the layers are said to perform feature extraction and we could potentially also use more powerful layers. However, if we want to keep the model proprietary then this adds significant complexity as the parameters now have to be distributed to the clients in some form.

Another typical approach to speed up training is to first apply dimensionality reduction techniques such as a principal component analysis. This approach is taken in the encrypted setting in BSS+’17.

Adapting the Protocol

Having looked at the model we next turn to the protocol: as well shall see, understanding the operations we want to perform can help speed things up.

In particular, a lot of the computation can be moved to the crypto provider, who’s generated raw material is independent of the private inputs and to some extend even the model. As such, its computation may be done in advance whenever it’s convenient and at large scale.

Recall from earlier that it’s relevant to optimise both round and communication complexity, and the extensions suggested here are often aimed at improving these at the expense of additional local computation. As such, practical experiments are needed to validate their benefits under concrete conditions.

Dropout

Starting with the easiest type of layer, we notice that nothing special related to secure computation happens here, and the only thing is to make sure that the two servers agree on which values to drop in each training iteration. This can be done by simply agreeing on a seed value.

Average pooling

The forward pass of average pooling only requires a summation followed by a division with a public denominator. Hence, it can be implemented by a multiplication with a public value: since the denominator is public we can easily find its inverse and then simply multiply and truncate. Likewise, the backward pass is simply a scaling, and hence both directions are entirely local operations.

Dense layers

The dot product needed for both the forward and backward pass of dense layers can of course be implemented in the typical fashion using multiplication and addition. If we want to compute dot(x, y) for matrices x and y with shapes respectively (m, k) and (k, n) then this requires m * n * k multiplications, meaning we have to communicate the same number of masked values. While these can all be sent in parallel so we only need one round, if we allow ourselves to use another kind of preprocessed triple then we can reduce the communication cost by an order of magnitude.

For instance, the second dense layer in our model computes a dot product between a (32, 128) and a (128, 5) matrix. Using the typical approach requires sending 32 * 5 * 128 == 22400 masked values per batch, but by using the preprocessed triples described below we instead only have to send 32 * 128 + 5 * 128 == 4736 values, almost a factor 5 improvement. For the first dense layer it is even greater, namely slightly more than a factor 25.

As also noted previously, the trick is to ensure that each private value in the matrices is only sent masked once. To make this work we need triples (a, b, c) of random matrices a and b with the appropriate shapes and such that c == dot(a, b).

def generate_dot_triple(x_shape, y_shape):
    a = sample_random_tensor(x_shape)
    b = sample_random_tensor(y_shape)
    c = np.dot(a, b) % Q
    return PrivateTensor(a), PrivateTensor(b), PrivateTensor(c)

Given such a triple we can instead communicate the values of alpha = x - a and beta = y - b followed by a local computation to obtain dot(x, y).

class PrivateTensor:
    
    ...
        
    def dot(x, y):
      
        if type(y) is PublicTensor:
            shares0 = x.shares0.dot(y.values) % Q
            shares1 = x.shares1.dot(y.values) % Q
            return PrivateTensor(None, shares0, shares1)
            
        if type(y) is PrivateTensor:
            a, b, a_dot_b = generate_dot_triple(x.shape, y.shape)
            alpha = (x - a).reconstruct()
            beta  = (y - b).reconstruct()
            return alpha.dot(beta) + \
                   alpha.dot(b) + \
                   a.dot(beta) + \
                   a_dot_b

Security of using these triples follows the same argument as for multiplication triples: the communicated masked values perfectly hides x and y while c being an independent fresh sharing makes sure that the result cannot leak anything about its constitutes.

Note that this kind of triple is used in SecureML, which also give techniques allowing the servers to generate them without the help of the crypto provider.

Convolutions

Like dense layers, convolutions can be treated either as a series of scalar multiplications or as a matrix multiplication, although the latter only after first expanding the tensor of training samples into a matrix with significant duplication. Unsurprisingly this leads to communication costs that in both cases can be improved by introducing another kind of triple.

As an example, the first convolution maps a tensor with shape (m, 28, 28, 1) to one with shape (m, 28, 28, 32) using 32 filters of shape (3, 3, 1) (excluding the bias vector). For batch size m == 32 this means 7,225,344 communicated elements if we’re using only scalar multiplications, and 226,080 if using a matrix multiplication. However, since there are only (32*28*28) + (32*3*3) == 25,376 private values involved in total (again not counting bias since they only require addition), we see that there is roughly a factor 9 overhead. In other words, each private value is being masked and sent several times. With a new kind of triple we can remove this overhead and save on communication cost: for 64 bit elements this means 200KB per batch instead of respectively 1.7MB and 55MB.

The triples (a, b, c) we need here are similar to those used in dot products, with a and b having shapes matching the two inputs, i.e. (m, 28, 28, 1) and (32, 3, 3, 1), and c matching output shape (m, 28, 28, 32).

Sigmoid activations

As done earlier, we may use a degree-9 polynomial to approximate the sigmoid activation function with a sufficient level of accuracy. Evaluating this polynomial for a private value x requires computing a series of powers of x, which of course may be done by sequential multiplication – but this means several rounds and corresponding amount of communication.

As an alternative we can again use a new kind of preprocessed triple that allows us to compute all required powers in a single round. As shown previously, the length of these “triples” is not fixed but equals the highest exponent, such that a triple for e.g. squaring consists of independent sharings of a and a**2, while one for cubing consists of independent sharings of a, a**2, and a**3.

Once we have these powers of x, evaluating a polynomial with public coefficients is then just a local weighted sum. The security of this again follows from the fact that all powers in the triple are independently shared.

def pol_public(x, coeffs, triple):
    powers = pows(x, triple)
    return sum( xe * ce for xe, ce in zip(powers, coeffs) )

We have the same caveat related to fixed-point precision as earlier though, namely that we need more room for the higher precision of the powers: x**n has n times the precision of x and we want to make sure that it does not wrap around modulo Q since then we cannot decode correctly anymore. As done there, we can solve this by introducing a sufficiently larger field P to which we temporarily switch while computing the powers, at the expense of two extra rounds of communication.

Practical experiments can show whether it best to stay in Q and use a few more multiplication rounds, or perform the switch and pay for conversion and arithmetic on larger numbers. Specifically, for low degree polynomials the former is likely better.

Proof of Concept Implementation

A proof-of-concept implementation without networking is available for experimentation and reproducibility. Still a work in progress, the code currently supports training a new classifier from encrypted features, but not feature extraction on encrypted images. In other words, it assumes that the input providers themselves run their images through the feature extraction layers and send the results in encrypted form to the servers; as such, the weights for that part of the model are currently not kept private. A future version will address this and allow training and predictions directly from images by enabling the feature layers to also run on encrypted data.

from pond.nn import Sequential, Dense, Sigmoid, Dropout, Reveal, Softmax, CrossEntropy
from pond.tensor import PrivateEncodedTensor

classifier = Sequential([
    Dense(128, 6272),
    Sigmoid(),
    Dropout(.5),
    Dense(5, 128),
    Reveal(),
    Softmax()
])

classifier.initialize()

classifier.fit(
    PrivateEncodedTensor(x_train_features), 
    PrivateEncodedTensor(y_train), 
    loss=CrossEntropy(), 
    epochs=3
)

The code is split into several Python notebooks, and comes with a set of precomputed weights that allows for skipping some of the steps:

The first one deals with pre-training on the public data using Keras, and produces the model used for feature extraction. This step can be skipped by using the repository’s precomputed weights instead.
The second one applies the above model to do feature extraction on the private data, thereby producing the features used for training the new encrypted classifier. In future versions this will be done by first encrypting the data. This step cannot be skipped as the extracted data is too large.
The third takes the extracted features and trains a new encrypted classifier. This is by far the most expensive step and may be skipped by using the repository’s precomputed weights instead.
Finally, the fourth notebook uses the new classifier to perform encrypted predictions from new images. Again feature extraction is currently done unencrypted.

Running the code is a matter of cloning the repository

$ git clone https://github.com/mortendahl/privateml.git && \
  cd privateml/image-analysis/

installing the dependencies

$ pip3 install jupyter numpy tensorflow keras h5py

launching a notebook

$ jupyter notebook

and navigating to either of the four notebooks mentioned above.

Thoughts

As always, when previous thoughts and questions have been answered there is already a new batch waiting.

Generalised triples

When seeking to reduce communication, one may also wonder how much can be pushed to the preprocessing phase in the form of additional types of triples.

As mentioned several times (and also suggested in e.g. BCG+’17), we typically seek to ensure that each private value is only sent masked once. So if we are e.g. computing both dot(x, y) and dot(x, z) then it might make sense to have a triple (r, s, t, u, v) where r is used to mask x, s to mask y, u to mask z, and t and u are used to compute the result. This pattern happens during training for instance, where values computed during the forward pass are sometimes cached and reused during the backward pass.

Perhaps more importantly though is when we are only making predictions with a model, i.e. computing with fixed private weights. In this case we only want to mask the weights once and then reuse these for each prediction. Doing so means we only have to mask and communicate proportionally to the input tensor flowing through the model, as opposed to propotionally to both the input tensor and the weights, as also done in e.g. JVC’18. More generally, we ideally want to communicate proportionally only to the values that change, which can be achieved (in an amortised sense) using tailored triples.

Finally, it is in principle also possible to have triples for more advanced functions such as evaluating both a dense layer and its activation function with a single round of communication, but the big obstacle here seems to be scalability in terms of triple storage and amount of computation needed for the recombination step, especially when working with tensors.

Activation functions

A natural question is which of the other typical activation functions are efficient in the encrypted setting. As mentioned above, SecureML makes use of ReLU by temporarily switching to garbled circuits, and CryptoDL gives low-degree polynomial approximations to both Sigmoid, ReLU, and Tanh (using Chebyshev polynomials for better accuracy).

It may also be relevant to consider non-typical but simpler activations functions, such as squaring as in e.g. CryptoNets, if for nothing else than simplifying both computation and communication.

Garbled circuits

While mentioned above only as a way of securely evaluating more advanced activation functions, garbled circuits could in fact also be used for larger parts, including as the main means of secure computation as done in for instance DeepSecure.

Compared to e.g. SPDZ this technique has the benefit of using only a constant number of communication rounds. The downside is that operations are now often happening on bits instead of on larger field elements, meaning more computation is involved.

Precision

A lot of the research around federated learning involve gradient compression in order to save on communication cost. Closer to our setting we have BMMP’17 which uses quantization to apply homomorphic encryption to deep learning, and even unencrypted production-ready systems often consider this technique as a way of improving performance also in terms of learning.

Floating point arithmetic

Above we used a fixed-point encoding of real numbers into field elements, yet unencrypted deep learning is typically using a floating point encoding. As shown in ABZS’12 and the reference implementation of SPDZ, it is also possible to use the latter in the encrypted setting, apparently with performance advantages for certain operations.

GPUs

Since deep learning is typically done on GPUs today for performance reasons, it is natural to consider whether similar speedups can be achieved by applying them in MPC computations. Some work exist on this topic for garbled circuits, yet it seems less popular in the secret sharing setting of e.g. SPDZ.

Biggest problem here might be maturity and availability of arbitrary precision arithmetic on GPUs (but see e.g. this and that) as needed for computations on field elements larger than e.g. 64 bits. Two things might be worth keeping in mind here though: firstly, while the values we compute on are larger than those natively supported, they are still bounded by the modulus; and secondly, we can do our secure computations over a ring instead of a field.

The SPDZ Protocol, Part 2

2017-09-10T12:00:00+00:00

This post is still very much a work in progress.

TL;DR: …

Triples

Underlying principle

(we turn our operation into a linear operation between private shares and public information; illustrate with mul; SageMath?)

Squaring

def generate_square_triple():
    a = random.randrange(Q)
    aa = pow(a, 2, Q)
    return PrivateValue(a), PrivateValue(aa)

class PrivateValue:
    
    ...
    
    def square(x):
        a, aa = generate_square_triple()
        alpha = (x - a).reconstruct()
        return alpha.square() + \
               (a * alpha) * 2 + \
               aa

Dot

Powering

As an alternative we can again use a new kind of preprocessed triple that allows exponentiation to all required powers to be done in a single round. The length of these “triples” is not fixed but equals the highest exponent, such that a triple for squaring, for instance, consists of independent sharings of a and a^2, while one for cubing consists of independent sharings of a, a^2, and a^3.

def generate_pows_triple(exponent, shape):
    a = np.random.randint(Q, size=shape)
    return [ share(np.power(a, e) % Q) for e in range(1, exponent+1) ]

To use these we notice that if epsilon = x - a then x^n == (epsilon + a)^n, which by the binomal theorem may be expressed as a weighted sum of epsilon^n * a^0, …, epsilon^0 * a^n using the binomial coefficients as weights. For instance, we have x^3 == (c0 * epsilon^3) + (c1 * epsilon^2 * a) + (c2 * epsilon * a^2) + (c3 * a^3) with ck = C(3, k).

Moreover, a triple for e.g. cubing x can also simultaneously be used for squaring x simply by skipping some powers and computing different binomial coefficients. Hence, all intermediate powers may be computed using a single triple and communication of one field element. The security of this again follows from the fact that all powers in the triple are independently shared.

def pows(x, triple):
    # local masking
    a = triple[0]
    v = sub(x, a)
    # communication: the players simultanously send their share to the other
    epsilon = reconstruct(v)
    # local combination to compute all powers
    x_powers = []
    for exponent in range(1, len(triple)+1):
        # prepare all term values
        a_powers = [ONE] + triple[:exponent]
        e_powers = [ pow(epsilon, e, Q) for e in range(exponent+1) ]
        coeffs   = [ binom(exponent, k) for k in range(exponent+1) ]
        # compute and sum terms
        terms = ( mul_public(a,e*c) for a,e,c in zip(a_powers,reversed(e_powers),coeffs) )
        x_powers.append(reduce(lambda x,y: add(x, y), terms))
    return x_powers

There is one caveat however, and that is that we now need room for the higher precision of the powers: x^n has n times the precision of x and we want to make sure that this value does not wrap around modulo Q.

One way around this is to temporarily switch to a larger field and compute the powers and truncation there. The conversion to and from this larger field P each take one round of communication, so polynomial evaluation ends up taking a total of three rounds.

Security wise we also have to pay a small price, although from a practical perspective there is little difference. In particular, for this operation we rely on statistical security instead of perfect security: since r is not an uniform random element here, there’s a tiny risk that something will be leaked about x.

def generate_statistical_mask():
    return random.randrange(2*BOUND * 10**KAPPA)

def generate_zero_triple(field):
    return share(0, field)

def convert(x, from_field, to_field, zero_triple):
    # local mapping to positive representation
    x = add_public(x, BOUND, from_field)
    # local masking and conversion by player 0
    r = generate_statistical_mask()
    y0 = (zero_triple[0] - r) % to_field
    # exchange of masked share: one round of communication
    e = (x[0] + r) % from_field
    # local conversion by player 1
    xr = (e + x[1]) % from_field
    y1 = (zero_triple[1] + xr) % to_field
    # local mapping back from positive representation
    y = [y0, y1]
    y = sub_public(y, BOUND, to_field)
    return y

def upshare(x, large_zero_triple):
    return convert(x, Q, P, large_zero_triple)

def downshare(x, small_zero_triple):
    return convert(x, P, Q, small_zero_triple)

Note that we could of course decide to simply do all computations in the larger field P, thereby avoiding the conversion steps. This will likely slow down the local computations by a non-trivial factor however, as we may need arbitrary precision arithmetic for P as opposed to e.g. 64 bit native arithmetic for Q.

Practical experiments will show whether it best to stay in Q and use a few more rounds, or switch temporarily to P and pay for conversion and arbitrary precision arithmetic. Specifically, for low degree polynomials the former is likely better.

Generalised triples

TODO only send the masking of a value once; reuse masked version until the variable is updated

When seeking to reduce communication, one may also wonder how much can be pushed to the preprocessing phase in the form of additional types of triples.

As mentioned earlier, we might seek to ensure that each private value is only sent masked once. So if we are e.g. computing both dot(X, Y) and dot(X, Z) then it might make sense to have a triple (R, S, T, U, V) that allows us to compute both results yet only send X masked once, as done in e.g. BCG+’17.

One relevant case is training, where some values are used to compute both the output of the layer during the forward phase, but also typically cached and used again to update the weights during the backward phase (for instance in dense layers).

Another, perhaps more important case, is if we are only interested in during prediction: TODO TODO TODO

Additionally, it might also be possible to have triples for more advanced functions such as evaluating both a dense layer and its activation function with a single round of communication. Main question here again seems to be efficiency, this time in terms of triple storage and amount of computation needed for the recombination step.

The SPDZ Protocol, Part 1

2017-09-03T12:00:00+00:00

This post is still very much a work in progress.

TL;DR: this is the first in a series of posts explaining a state-of-the-art protocol for secure computation.

In this blog post we’ll go through the state-of-the-art SPDZ protocol for secure computation. Unlike the protocol used in a previous blog post, SPDZ allows us to have as few as two parties computing on private values. Moreover, it has received significant scientific attention over the last few years and as a result several optimisations are known that can used to speed up our computation.

In this series we’ll go through and describe the state-of-the-art SPDZ protocol for secure computation. Unlike the protocol used in a previous blog post, SPDZ allows us to have as few as two parties computing on private values and it allows us to move parts of the computation to an offline phase in order to gain a more performant online phase. Moreover, it has received significant scientific attention over the last few years that resulted in various optimisations and efficient implementations.

The code for this section is available in this associated notebook.

Background

The protocol was first described in SPZD’12 and DKLPSS’13, but have also been the subject of at least one series of blog posts. Several implementations exist, including one from the cryptography group at the University of Bristol providing both high performance and full active security.

As usual, all computations take place in a finite ring, often identified by a prime modulus Q. As we will see, this means we also need a way to encode the fixed-point numbers used by the CNNs as integers modulo a prime, and we have to take care that these never “wrap around” as we then may not be able to recover the correct result.

Moreover, while the computational resources used by a procedure is often only measured in time complexity, i.e. the time it takes the CPU to perform the computation, with interactive computations such as the SPDZ protocol it also becomes relevant to consider communication and round complexity. The former measures the number of bits sent across the network, which is a relatively slow process, and the latter the number of synchronisation points needed between the two parties, which may block one of them with nothing to do until the other catches up. Both hence also have a big impact on overall executing time.

Concretely, we have an interest in keeping Q is small as possible, not only because we can then do arithmetic operations using only a single word sized operations (as opposed to arbitrary precision arithmetic which is significantly slower), but also because we have to transmit less bits when sending field elements across the network.

Note that while the protocol in general supports computations between any number of parties we here present it for the two-party setting only. Moreover, as mentioned earlier, we aim only for passive security and assume a crypto provider that will honestly generate the needed triples.

Note that while the protocol in general supports computations between any number of parties we here use and specialise it for the two-party setting only. Moreover, as mentioned earlier, we aim only for passive security and assume a crypto provider that will honestly generate the needed triples.

Setting

The input providers are only needed in the very beginning to transmit their training data; after that all computations involve only the two servers, meaning it is indeed plausible for the input providers to use e.g. mobile phones. Once trained, the model will remain jointly held in encrypted form by the two servers where anyone can use it to make further encrypted predictions.

Secure Computation with SPDZ

Sharing a private value between the two servers is done using the simple additive scheme. This may be performed by anyone, including an input provider, and keeps the value perfectly private as long as the servers are not colluding.

def share(secret):
    share0 = random.randrange(Q)
    share1 = (secret - share0) % Q
    return [share0, share1]

And when specified by the protocol, the private value can be reconstruct by a server sending his share to the other.

def reconstruct(share0, share1):
    return (share0 + share1) % Q

Of course, if both parties are to learn the private value then they can send their share simultaneously and hence still only use one round of communication.

Note that the use of an additive scheme means the servers are required to be highly robust, unlike e.g. Shamir’s scheme which may handle some servers dropping out. If this is a reasonable assumption though, then additive sharing provides significant advantages.

class PrivateValue:

    def __init__(self, value, share0=None, share1=None):
        if not value is None:
            share0, share1 = share(value)
        self.share0 = share0
        self.share1 = share1

    def reconstruct(self):
        return PublicValue(reconstruct(self.share0, self.share1))

Linear operations

Having obtained sharings of private values we may next perform certain operations on these. The first set of these is what we call linear operations since they allow us to form linear combinations of private values.

The first are addition and subtraction, which are simple local computations on the shares already held by each server. And if one of the values is public then we may simplify.

class PrivateValue:

    ...

    def add(x, y):
        if type(y) is PublicValue:
            share0 = (x.share0 + y.value) % Q
            share1 =  x.share1
            return PrivateValue(None, share0, share1)    
        if type(y) is PrivateValue:
            share0 = (x.share0 + y.share0) % Q
            share1 = (x.share1 + y.share1) % Q
            return PrivateValue(None, share0, share1)

    def sub(x, y):
        if type(y) is PublicValue:
            share0 = (x.share0 - y.value) % Q
            share1 =  x.share1
            return PrivateValue(None, share0, share1)
        if type(y) is PrivateValue:
            share0 = (x.share0 - y.share0) % Q
            share1 = (x.share1 - y.share1) % Q
            return PrivateValue(None, share0, share1)

x = PrivateValue(5)
y = PrivateValue(3)

z = x + y
assert z.reconstruct() == 8

Next we may also perform multiplication with a public value by again only performing a local operation on the share already held by each server.

class PrivateValue:

    ...

    def mul(x, y):
        if type(y) is PublicValue:
            share0 = (x.share0 * y.value) % Q
            share1 = (x.share1 * y.value) % Q
            return PrivateValue(None, share0, share1)

Note that the security of these operations is straight-forward since no communication is taking place between the two parties and hence nothing new could have been revealed.

x = PrivateValue(5)
y = PublicValue(3)

z = x * y
assert z.reconstruct() == 15

Multiplication

Multiplication of two private values is where we really start to deviate from the protocol used previously. The techniques used there inherently need at least three parties so won’t be much help in our two party setting.

Perhaps more interesting though, is that the new techniques used here allow us to shift parts of the computation to an offline phase where raw material that doesn’t depend on any of the private values can be generated at convenience. As we shall see later, this can be used to significantly speed up the online phase where training and prediction is taking place.

This raw material is popularly called a multiplication triple (and sometimes Beaver triple due to their introduction in Beaver’91) and consists of independent sharings of three values a, b, and c such that a and b are uniformly random values and c == a * b % Q. Here we assume that these triples are generated by the crypto provider, and the resulting shares distributed to the two parties ahead of running the online phase. In other words, when performing a multiplication we assume that Pi already knows a[i], b[i], and c[i].

def generate_mul_triple():
    a = random.randrange(Q)
    b = random.randrange(Q)
    c = (a * b) % Q
    return PrivateValue(a), PrivateValue(b), PrivateValue(c)

Note that a large portion of efforts in current research and the full reference implementation is spent on removing the crypto provider and instead letting the parties generate these triples on their own; we won’t go into that here but see the resources pointed to earlier for details.

To use multiplication triples to compute the product of two private values x and y we proceed as follows. The idea is simply to use a and b to respectively mask x and y and then reconstruct the masked values as respectively alpha and beta. As public values, alpha and beta may then be combined locally by each server to form a sharing of z == x * y.

class PrivateValue:

    ...

    def mul(x, y):
        if type(y) is PublicValue:
            ...
        if type(y) is PrivateValue:
            a, b, a_mul_b = generate_mul_triple()
            # local masking followed by communication of the reconstructed values
            alpha = (x - a).reconstruct()
            beta  = (y - b).reconstruct()
            # local re-combination
            return alpha.mul(beta) + \
                   alpha.mul(b) + \
                   a.mul(beta) + \
                   a_mul_b

If we write out the equations we see that alpha * beta == xy - xb - ay + ab, a * beta == ay - ab, and b * alpha == bx - ab, so that the sum of these with c cancels out everything except xy. In terms of complexity we see that communication of two field elements in one round is required.

Finally, since x and y are perfectly hidden by a and b, neither server learns anything new as long as each triple is only used once. Moreover, the newly formed sharing of z is “fresh” in the sense that it contains no information about the sharings of x and y that were used in its construction, since the sharing of c was independent of the sharings of a and b.

Next Steps

Secret Sharing, Part 3

2017-08-13T12:00:00+00:00

TL;DR: due to redundancy in the way shares are generated, we can compensate not only for some of them being lost but also for some being manipulated; here we look at how to do this using decoding methods for Reed-Solomon codes.

Returning to our motivation in part one for using secret sharing, namely to distribute trust, we recall that the generated shares are given to shareholders that we may not trust individually. As such, if we later ask for the shares back in order to reconstruct the secret then it is natural to consider how reasonable it is to assume that we will receive the original shares back.

Specifically, what if some shares are lost, or what if some shares are manipulated to differ from the initially ones? Both may happen due to simple systems failure, but may also be the result of malicious behaviour on the part of shareholders. Should we in these two cases still expect to be able to recover the secret?

In this blog post we will see how to handle both situations. We will use simpler algorithms, but note towards the end how techniques like those used in part two can be used to make the process more efficient.

As usual, all code is available in the associated Python notebook.

Robust Reconstruction

In the first part we saw how Lagrange interpolation can be used to answer the first question, in that it allows us to reconstruct the secret as long as only a bounded number of shares are lost. As mentioned in the second part, this is due to the redundancy that comes with point-value presentations of polynomials, namely that the original polynomial is uniquely defined by any large enough subset of the shares. Concretely, if D is the degree of the original polynomial then we can reconstruct given R = D + 1 shares in case of Shamir’s scheme and R = D + K shares in the packed variant; if N is the total number of shares we can hence afford to loose N - R shares.

But this is assuming that the received shares are unaltered, and the second question concerning recovery in the face of manipulated shares is intuitively harder as we now cannot easily identify when and where something went wrong. (Note that it is also harder in a more formal sense, namely that a solution for manipulated shares can be used as a solution for lost shares, since dummy values, e.g. a constant, may be substituted for the lost shares and then instead treated as having been manipulated. This however, is not optimal.)

To solve this issue we will use techniques from error-correction codes, specifically the well-known Reed-Solomon codes. The reason we can do this is that share generation is very similar to (non-systemic) message encoding in these codes, and hence their decoding algorithms can be used to reconstruct even in the face of manipulated shares.

The robust reconstruct method for Shamir’s scheme we end up with is as follows, with a straight forward generalisation to the packed scheme. The input is a complete list of length N of received shares, where missing shares are represented by None and manipulated shares by their new value. And if reconstruction goes well then the output is not only the secret, but also the indices of the shares that were manipulated.

def shamir_robust_reconstruct(shares):
  
    # filter missing shares
    points_values = [ (p,v) for p,v in zip(POINTS, shares) if v is not None ]
    
    # decode remaining faulty
    points, values = zip(*points_values)
    polynomial, error_locator = gao_decoding(points, values, R, MAX_MANIPULATED)
    
    # check if recovery was possible
    if polynomial is None:
        # there were more errors than assumed by `MAX_ERRORS`
        raise Exception("Too many errors, cannot reconstruct")
    else:
        # recover secret
        secret = poly_eval(polynomial, 0)
    
        # find roots of error locator polynomial
        error_indices = [ i 
            for i,v in enumerate( poly_eval(error_locator, p) for p in POINTS ) 
            if v == 0 
        ]

        return secret, error_indices

Having the error indices may be useful for instance as a deterrent: since we can identify malicious shareholders we may also be able to e.g. publicly shame them, and hence incentivise correct behaviour in the first place. Formally this is known as covert security, where shareholders are willing to cheat only if they are not caught.

Finally note that reconstruction may however fail, yet it can be shown that this only happens when there indeed isn’t enough information left to correctly identify the result; in other words, our method will never give a false negative. Parameters MAX_MISSING and MAX_MANIPULATED are used to characterise when failure can happen, giving respectively an upper bound on the number of lost and manipulated shares supported. What must hold in general is that the number of “redundancy shares” N - R must satisfy N - R >= MAX_MISSING + 2 * MAX_MANIPULATED, from which we see that we are paying a double price for manipulated shares compared to missing shares.

Outline of decoding algorithm

The specific decoding procedure we use here works by first finding an erroneous polynomial in coefficient representation that matches all received shares, including the manipulated ones. Hence we must first find a way to interpolate not only values but also coefficients from a polynomial given in point-value representation; in other words, we must find a way to convert from point-value representation to coefficient representation. We saw in part two how the backward FFT can do this in specific cases, but to handle missing shares we here instead adapt Lagrange interpolation as used in part one.

Given the erroneous polynomial we then extract a corrected polynomial from it to get our desired result. Surprisingly, this may simply be done by running the extended Euclidean algorithm on polynomials as shown below.

Finally, since both of these two steps are using polynomials as objects of computation, similarly to how one typically uses integers as objects of computation, we must first also give algorithms for polynomial arithmetic such as adding and multiplying.

Computing on Polynomials

We assume we already have various functions base_add, base_sub, base_mul, etc. for computing in the base field; concretely this simply amounts to integer arithmetic modulo a fixed prime in our case.

We then represent polynomials over this base field by their list of coefficients: A(x) = (a0) + (a1 * x) + ... + (aD * x^D) is represented by A = [a0, a1, ..., aD]. Furthermore, we keep as an invariant that aD != 0 and enforce this below through a canonical procedure that removes all trailing zeros.

def canonical(A):
    for i in reversed(range(len(A))):
        if A[i] != 0:
            return A[:i+1]
    return []

However, as an intermediate step we will sometimes first need to expand one of two polynomials to ensure they have the same length. This is done by simply appending zero coefficients to the shorter list.

def expand_to_match(A, B):
    diff = len(A) - len(B)
    if diff > 0: 
        return A, B + [0] * diff
    elif diff < 0:
        diff = abs(diff)
        return A + [0] * diff, B
    else: 
        return A, B

With this we can perform arithmetic on polynomials by simply using the standard definitions. Specifically, to add two polynomials A and B given by coefficient lists [a0, ..., aM] and [b0, ..., bN] we perform component-wise addition of the coefficients ai + bi. For example, adding A(x) = 2x + 3x^2 to B(x) = 1 + 4x^3 we get A(x) + B(x) = (0+1) + (2+0)x + (3+0)x^2 + (0+4)x^3; the first two are represented by [0,2,3] and [1,0,0,4] respectively, and their sum by [1,2,3,4]. Subtraction is similarly done component-wise.

def poly_add(A, B):
    F, G = expand_to_match(A, B)
    return canonical([ base_add(f, g) for f, g in zip(F, G) ])
    
def poly_sub(A, B):
    F, G = expand_to_match(A, B)
    return canonical([ base_sub(f, g) for f, g in zip(F, G) ])

We also do scalar multiplication component-wise, i.e. by scaling every coefficient of a polynomial by an element from the base field. For instance, with A(x) = 1 + 2x + 3x^2 we have 2 * A(x) = 2 + 4x + 6x^2, which as expected is the same as A(x) + A(x).

def poly_scalarmul(A, b):
    return canonical([ base_mul(a, b) for a in A ])

def poly_scalardiv(A, b):
    return canonical([ base_div(a, b) for a in A ])

Multiplication of two polynomials is only slightly more complex, with coefficient cK of the product being defined by cK = sum( aI * bJ for i,aI in enumerate(A) for j,bJ in enumerate(B) if i + j == K ), and by changing the computation slightly we avoid iterating over K.

def poly_mul(A, B):
    C = [0] * (len(A) + len(B) - 1)
    for i in range(len(A)):
        for j in range(len(B)):
            C[i+j] = base_add(C[i+j], base_mul(A[i], B[j]))
    return canonical(C)

We also need to be able to divide a polynomial A by another polynomial B, effectively finding a quotient polynomial Q and a remainder polynomial R such that A == Q * B + R with degree(R) < degree(B). The procedure works like long-division for integers and is explained in details elsewhere.

def poly_divmod(A, B):
    t = base_inverse(lc(B))
    Q = [0] * len(A)
    R = copy(A)
    for i in reversed(range(0, len(A) - len(B) + 1)):
        Q[i] = base_mul(t, R[i + len(B) - 1])
        for j in range(len(B)):
            R[i+j] = base_sub(R[i+j], base_mul(Q[i], B[j]))
    return canonical(Q), canonical(R)

Note that we have used basic algorithms for these operations here but that more efficient versions exist. Some pointers to these are given at the end.

Interpolating Polynomials

We next turn to the task of converting a polynomial given in (implicit) point-value representation to its (explicit) coefficient representation. Several procedures exist for this, including efficient algorithms for specific cases such as the backward FFT seen earlier, and general ones based e.g. on Newton’s method that seem popular in numerical analysis due to its better efficiency and ability to handle new data points. However, for this post we’ll use Lagrange interpolation and see that although it’s perhaps typically see as a procedure for interpolating the values of polynomials, it also works just as well for interpolating their coefficients.

Recall that we are given points x0, x1, ..., xD and values y0, y1, ..., yD implicitly defining a polynomial F. Earlier we then used Lagrange’s method to find value F(x) at a potentially different point x. This works due to the constructive nature of Lagrange’s proof, where a polynomial H is defined as H(X) = y0 * L0(X) + ... + yD * LD(X) for indeterminate X and Lagrange basis polynomials Li, and then shown identical to F. To find F(x) we then simply evaluated H(x), although we precomputed Li(x) as the Lagrange constants ci so that this step simply reduced to a weighted sum y1 * c1 + ... yD * cD.

def lagrange_constants_for_point(points, point):
    constants = []
    for i, xi in enumerate(points):
        numerator = 1
        denominator = 1
        for j, xj in enumerate(points):
            if i == j: continue
            numerator   = base_mul(numerator, base_sub(point, xj))
            denominator = base_mul(denominator, base_sub(xi, xj))
        constant = base_div(numerator, denominator)
        constants.append(constant)
    return constants

Now, when we want the coefficients of F instead of just its value F(x) at x, we see that while H is identical to F it only gives us a semi-explicit representation, made worse by the fact that the Li polynomials are also only given in a semi-explicit representation: Li(X) = (X - x0) * ... * (X - xD) / (xi - x0) * ... * (xi - xD). However, since we developed algorithms for using polynomials as objects in computations, we can simply evaluate these expression with indeterminate X to find the reduced explicit form! See for instance the examples here.

def lagrange_polynomials(points):
    polys = []
    for i, xi in enumerate(points):
        numerator = [1]
        denominator = 1
        for j, xj in enumerate(points):
            if i == j: continue
            numerator   = poly_mul(numerator, [base_sub(0, xj), 1])
            denominator = base_mul(denominator, base_sub(xi, xj))
        poly = poly_scalardiv(numerator, denominator)
        polys.append(poly)
    return polys

Doing this also for H gives us the interpolated polynomial in explicit coefficient representation.

def lagrange_interpolation(points, values):
    ls = lagrange_polynomials(points)
    poly = []
    for i, yi in enumerate(values):
        term = poly_scalarmul(ls[i], yi)
        poly = poly_add(poly, term)
    return poly

While this may not be the most efficient way (see notes later), it is hard to beat its simplicity.

Correcting Errors

In the non-systemic variants of Reed-Solomon codes, a message m represented by a vector [m0, ..., mD] is encoded by interpreting it as a polynomial F(X) = (m0) + (m1 * X) + ... + (mD * X^D) and then evaluating F at a fixed set of points to get the code word. Unlike share generation, no randomness is used in this process since the purpose is only to provide redundancy and not privacy (in fact, in the systemic variants, the message is directly readable from the code word), yet this doesn’t change the fact that we can use decoding procedures to correct errors in shares.

Several such decoding procedures exist, some of which are explained here and there, yet the one we’ll use here is conceptually simple and has a certain beauty to it. Also keep in mind that some of the typical optimizations used in implementations of the alternative approaches get their speed-up by relying on properties of the more common setting over binary extension fields, while we here are interested in the setting over prime fields as we would like to simulate (bounded) integer arithmetic in our application of secret sharing to secure computation – which is straight forward in prime fields but less clear in binary extension fields.

The approach we will use was first described in SKHN’75, yet we’ll follow the algorithm given in Gao’02 (see also Section 17.5 in Shoup’08). It works by first interpolating a potentially faulty polynomial H from all the available shares and then running the extended Euclidean algorithm to either extract the original polynomial G or (rightly) declare it impossible. That the algorithm can be used for this is surprising and is strongly related to rational reconstruction.

Extended Euclidean algorithm on polynomials

Assume that we have two polynomials H and F and we would like to find linear combinations of these in the form of triples (R, T, S) of polynomials such that R == H * T + F * S. This may of course be done in many different ways, but one particular interesting approach is to consider the list of triples (R0, T0, S0), ..., (RM, TM, SM) generated by the extended Euclidean algorithm (EEA).

def poly_eea(F, H):
    R0, R1 = F, H
    S0, S1 = [1], []
    T0, T1 = [], [1]
    
    triples = []
    
    while R1 != []:
        Q, R2 = poly_divmod(R0, R1)
        
        triples.append( (R0, S0, T0) )
        
        R0, S0, T0, R1, S1, T1 = \
            R1, S1, T1, \
            R2, poly_sub(S0, poly_mul(S1, Q)), poly_sub(T0, poly_mul(T1, Q))
            
    return triples

The reason for this is that this list turns out to represent all triples up to a certain size that satisfy the equation, in the sense that every “small” triple (R, T, S) for which R == T * H + S * F is actually just a scaled version of a triple (Ri, Ti, Si) occurring in the list generated by the EEA: for some constant a we have R == a * Ri, T == a * Ti, and S == a * Si. Moreover, given a concrete interpretation of “small” in the form of a degree bound on R and T, we may find the unique (Ri, Ti, Si) that this holds for.

Why this is useful in decoding becomes apparent next.

Euclidean decoding

Say that T is the unknown error locator polynomial, i.e. T(xi) == 0 exactly when share yi has been manipulated. Say also that R = T * G where G is the original polynomial that was used to generate the shares. Clearly, if we actually knew T and R then we could get what we’re after by a simple division R / T – but since we don’t we have to do something else.

Because we’re only after the ratio R / T, we see that knowing Ri and Ti such that R == a * Ri and T == a * Ti actually gives us the same result: R / T == (a * Ri) / (a * Ti) == Ri / Ti, and these we could potentially get from the EEA! The only obstacles are that we need to define polynomials H and F, and we need to be sure that there is a “small” triple with the R and T as defined here that satisfies the linear equation, which in turn means making sure there exists a suitable S. Once done, the output of poly_eea(H, F) will give us the needed Ri and Ti.

Perhaps unsurprisingly, H is the polynomial interpolated using all available values, which may potentially be faulty in case some of them have been manipulated. F = F1 * ... * FN is the product of polynomials Fi(X) = X - xi where X it the indeterminate and x1, ..., xN are the points.

Having defined H and F like this, we can then show that our R and T as defined above are “small” when the number of errors that have occurred are below the bounds discussed earlier. Likewise it can be shown that there is an S such that R == T * H + S * F; this involves showing that R - T * H == S * F, which follows from R == H * T mod F and in turn R == H * T mod Fi for all Fi. See standard textbooks for further details.

With this in place we have our decoding algorithm!

def gao_decoding(points, values, max_degree, max_error_count):

    # interpolate faulty polynomial
    H = lagrange_interpolation(points, values)
    
    # compute f
    F = [1]
    for xi in points:
        Fi = [base_sub(0, xi), 1]
        F = poly_mul(F, Fi)
    
    # run EEA-like algorithm on (F,H) to find EEA triple
    R0, R1 = F, H
    S0, S1 = [1], []
    T0, T1 = [], [1]
    while True:
        Q, R2 = poly_divmod(R0, R1)
        
        if deg(R0) < max_degree + max_error_count:
            G, leftover = poly_divmod(R0, T0)
            if leftover == []:
                decoded_polynomial = G
                error_locator = T0
                return decoded_polynomial, error_locator
            else:
                return None
        
        R0, S0, T0, R1, S1, T1 = \
            R1, S1, T1, \
            R2, poly_sub(S0, poly_mul(S1, Q)), poly_sub(T0, poly_mul(T1, Q))

Note however that it actually does more than promised above: it breaks down gracefully, by returning None instead of a wrong result, in case our assumption on the maximum number of errors turns out to be false. The intuition behind this is that if the assumption is true then T by definition is “small” and hence the properties of the EEA triple kick in to imply that the division is the same as R / T, which by definition of R has a zero remainder. And vice versa, if the remainder was zero then the returned polynomial is in fact less than the assumed number of errors away from H and hence T by definition is “small”. In other words, None is returned if and only if our assumption was false, which is pretty neat. See Gao’02 for further details.

Finally, note that it also gives us the error locations in the form of the roots of T. As mentioned earlier this is very useful from an application point of view, but could also have been obtained by simply comparing the received shares against a re-sharing based on the decoded polynomial.

Efficiency Improvements

The algorithms presented above have time complexity Oh(N^2) but are not the most efficient. Based on the second part we may straight away see how interpolation can be sped up by using the Fast Fourier Transform instead of Lagrange’s method. One downside is that we then need to assume that x1, ..., xN are Fourier points, i.e. with a special structure, and we need to fill in dummy values for the missing shares and hence pay the double price. Newton’s method alternatively avoids this constraint while potentially giving better concrete performance than Lagrange’s.

However, there are also other fast interpolation algorithms without these constraints, as detailed in for instance Modern Computer Algebra or this thesis, which also reduces the asymptotic complexity to Oh(N * log N). This former reference also contains fast Oh(N * log N) methods for arithmetic and the EEA.

Next Steps

The first three posts have been a lot of theory and it’s now time to turn to applications.

Recent Talks on Privacy

2017-08-12T12:00:00+00:00

During winter and spring I was fortunate enough to have a few occasions to talk about some of the work done at Snips on applying privacy-enhancing technologies in a start-up building privacy-aware machine learning systems for mobile devices.

These were mainly centered around the Secure Distributed Aggregator (SDA) for learning from user data distributed on mobile devices in a privacy-preserving manner, i.e. without learning any individual data only the final aggregation, but there was also room for discussion around privacy from a broader perspective, including how it has played into decisions made by the company.

What Privacy Has Meant For Snips

Given at the workshop on Privacy in Statistical Analysis (PSA’17), this invited talk aimed at giving an industrial perspective on privacy, including how it has played a role at Snips from its beginning. To this end the talk was divided into four areas where privacy had been involved, three of which briefly discussed below.

Accessing Data

Access to personal data was essential for the success of its first mobile app, so to ensure that this was given the company decided to earn users’ trust by focusing on privacy. To this end, it was decided to keep all data locally on users’ devices and do the processing there instead of on company servers.

These on-device privacy solutions have the extra benefit of being easy to explain, and may have accounted for the high percentage of users willing to give the mobile app access to sensitive information such as emails, chats, location tracking, and even screen content.

Protecting the Company

By the principle of Data is a Toxic Asset, not storing any user data means less to worry about if company servers are ever compromised. However, some services hosted by third parties, including the company, may build up a set of metadata that in itself could reveal something about the users and e.g. damage reputation. One such example is point-of-interest services where a user reveals his location in order to obtain e.g. a list of nearby restaurants.

Powerful cryptographic techniques, such as the Tor network and private information retrieval, may make it possible for companies to make private versions of these services, yet also impose a significant overhead. Instead, by assuming that the company is generally honest, a more efficient compromise can be reached by shifting the focus from deliberate malicious behaviour to easier problems such as accidental storing or logging.

One concrete approach taken for this was to strip sensitive information at the server entry point so that it was never exposed to subcomponents.

Learning from Data

While it is great for user privacy to only have locally stored data sets, it is also relevant for both users and the company to get insights from these, for instance as a way of making cross-user recommendations or getting model feedback.

The key to this contradiction is that often there is no need to share individual data as long as a global view can be computed. A brief comparison between techniques was made, including:

sensor networks: high performance but requires a lot of coordination between users
differential privacy: high performance and strong privacy guarantees, but a lot of data is needed for the signal to overcome the noise
homomorphic encryption: flexible and explainable, but still not very efficient and has the issue of who’s holding the decryption keys
multi-party computation: flexible and decent performance, but requires several players to distribute trust to

and concluding with the specialised multi-party computation protocol underlying SDA and further detailed below.

Private Data Aggregation on a Budget

Given at the workshop on Theory and Practice of Multi-Party Computation (TPMPC’17), this talk was technical in nature in that it presented the SDA protocol, but also aimed at illustrating the problem that a company may experience when wanting to solve a privacy problem by employing a secure multi-party computation (MPC) protocol: namely, that it may find itself to be the only party that is naturally motivated to invest resources into it.

Moreover, to remain open to as many potential other parties as possible, it is interesting to minimise the requirements on these in terms of computation, communication, and coordination. By doing so parties running e.g. mobile devices or web browsers may be considered. These concerns however, are not always considered in typical MPC protocols.

Community-based MPC

To this end SDA presents a simple but concrete proposal in a community-based model where members from a community are used as parties.

These parties only have to make a minimum of investment as most of the computation is out-sourced to the company and very little coordination is required between the selected members. Furthermore, a mechanism for distributing work is also presented that allows for lowering the individual load by involving more members.

The result is a practical protocol for aggregating high-dimensional vectors that is suitable for a single company with a community of sporadic members.

Applications

Concrete and realistic applications was also considered, including analytics, surveys, and place discovery based on users’ location history.

As illustrated, the load on community members in these applications were low enough to be reasonably run on mobile phones and even web browsers.

This work was also presented at Private Multi-Party Machine Learning (PMPML’16) in the form of a poster.

Secret Sharing, Part 2

2017-06-24T12:00:00+00:00

TL;DR: efficient secret sharing requires fast polynomial evaluation and interpolation; here we go through what it takes to use the well-known Fast Fourier Transform for this.

In the first part we looked at Shamir’s scheme, as well as its packed variant where several secrets are shared together. We saw that polynomials lie at the core of both schemes, and that implementation is basically a question of (partially) converting back and forth between two different representations of these. We also gave typical algorithms for doing this.

For this part we will look at somewhat more complex algorithms in an attempt to speed up the computations needed for generating shares. Specifically, we will implement and apply the Fast Fourier Transform, detailing all the essential steps. Performance measurements performed with our Rust implementation shows that this yields orders of magnitude of efficiency improvements when either the number of shares or the number of secrets is high.

There is also an associated Python notebook to better see how the code samples fit together in the bigger picture.

Polynomials

If we look back at Shamir’s scheme we see that it’s all about polynomials: a random polynomial embedding the secret is sampled and the shares are taken as its values at a certain set of points.

def shamir_share(secret):
    polynomial = sample_shamir_polynomial(secret)
    shares = [ evaluate_at_point(polynomial, p) for p in SHARE_POINTS ]
    return shares

The same goes for the packed variant, where several secrets are embedded in the sampled polynomial.

def packed_share(secrets):
    polynomial = sample_packed_polynomial(secrets)
    shares = [ interpolate_at_point(polynomial, p) for p in SHARE_POINTS ]
    return shares

Notice however that they differ slightly in the second steps where the shares are computed: Shamir’s scheme uses evaluate_at_point while the packed uses interpolate_at_point. The reason is that the sampled polynomial in the former case is in coefficient representation while in the latter it is in point-value representation.

Specifically, we often represent a polynomial f of degree D == L-1 by a list of L coefficients a0, ..., aD such that f(x) = (a0) + (a1 * x) + (a2 * x^2) + ... + (aD * x^D). This representation is convenient for many things, including efficiently evaluating the polynomial at a given point using e.g. Horner’s method.

However, every such polynomial may also be represented by a set of L point-value pairs (p1, v1), ..., (pL, vL) where vi == f(pi) and all the pi are distinct. Evaluating the polynomial at a given point is still possible, yet now requires a more involved interpolation procedure that may be less efficient.

But the point-value representation also has several advantages, most importantly that every element intuitively contributes with the same amount of information, unlike the coefficient representation where, in the case of secret sharing, a few elements are the actual secrets; this property gives us the privacy guarantee we are after. Moreover, a degree L-1 polynomial may also be represented by more than L pairs; in this case there is some redundancy in the representation that we may for instance take advantage of in secret sharing (to reconstruct even if some shares are lost) and in coding theory (to decode correctly even if some errors occur during transmission).

The reason this works is that the result of interpolation on a point-value representation with L pairs is technically speaking defined with respect to the least degree polynomial g such that g(pi) == vi for all pairs in the set, which is unique and has at most degree L-1. This means that if two point-value representations are generated using the same polynomial g then interpolation on these will yield identical results, even when the two sets are of different sizes or use different points, since the least degree polynomial is the same.

It is also why we can use the two representations somewhat interchangeably: if a point-value representation with L pairs where generated by a degree L-1 polynomial f, then the unique least degree polynomial agreeing with these must be f. And since, for a fixed set of points, the set of coefficient lists of length L and the set of value lists of length L has the same cardinality (in our case Q^L) we must have a bijection between them.

Fast Fourier Transform

With the two presentation of polynomials in mind we move on to how the Fast Fourier Transform (FFT) over finite fields – also known as the Number Theoretic Transform (NTT) – can be used to perform efficient conversion between them. And for me the best way of understanding this is through an example that can later be generalised into an algorithm.

Walk-through example

Recall that all our computations happen in a prime field determined by a fixed prime Q, i.e. using the numbers 0, 1, ..., Q-1. In this example we will use Q = 433, who’s order Q-1 is divisible by 4: Q-1 == 432 == 4 * k with k = 108.

Assume then that we have a polynomial A(x) = 1 + 2x + 3x^2 + 4x^3 over this field of with L == 4 coefficients and degree L-1 == 3.

A_coeffs = [ 1, 2, 3, 4 ]

Our goal is to turn this list of coefficients into a list of values [ A(w0), A(w1), A(w2), A(w3) ] of equal length, for points w = [w0, w1, w2, w3].

The standard way of evaluating polynomials is of course one way of during this, which using Horner’s rule can be done in a total of Oh(L * L) operations.

A = lambda x: horner_evaluate(A_coeffs, x)

assert([ A(wi) for wi in w ] 
    == [ 10, 73, 431, 356 ])

But as we will see, the FFT allows us to do so more efficiently when the length is sufficiently large and the points are chosen with a certain structure; asymptotically we can compute the values in Oh(L * log L) operations.

The first insight we need is that there is an alternative evaluation strategy that breaks A into two smaller polynomials. In particular, if we define polynomials B(y) = 1 + 3y and C(y) = 2 + 4y by taking every other coefficient from A then we have A(x) == B(x * x) + x * C(x * x), which is straight-forward to verify by simply writing out the right-hand side.

This means that if we know values of B(y) and C(y) at the squares v of the w points, then we can use these to compute the values of A(x) at the w points using table look-ups: A_values[i] = B_values[i] + w[i] * C_values[i].

# split A into B and C
B_coeffs = A_coeffs[0::2] # == [ 1,    3,   ]
C_coeffs = A_coeffs[1::2] # == [    2,    4 ]

# square the w points
v = [ wi * wi % Q for wi in w ]

# somehow compute the values of B and C at the v points
# ...
assert( B_values == [ B(vi) for vi in v ] )
assert( C_values == [ C(vi) for vi in v ] )

# combine results into values of A at the w points
A_values = [ ( B_values[i] + w[i] * C_values[i] ) % Q for i,_ in enumerate(w) ]

assert( A_values == [ A(wi) for wi in w ] )

So far we haven’t saved much, but the second insight fixes that: by picking the points w to be the elements of a subgroup of order 4, the v points used for B and C will form a subgroup of order 2 due to the squaring; hence, we will have v[0] == v[2] and v[1] == v[3] and so only need the first halves of B_values and C_values – as such we have cut the subproblems in half!

Such subgroups are typically characterized by a generator, i.e. an element of the field that when raised to powers will take on exactly the values of the subgroup elements. Historically such generators are denoted by the omega symbol so let’s follow that convention here as well.

# generator of subgroup of order 4
omega4 = 179

w = [ pow(omega4, e, Q) for e in range(4) ]
assert( w == [1, 179, 432, 254] )

We shall return to how to find such generator below, but note that once we know one of order 4 then it’s easy to find one of order 2: we simply square.

# generator of subgroup of order 2
omega2 = omega4 * omega4 % Q

v = [ pow(omega2, e, Q) for e in range(2) ]
assert( v == [1, 432] )

As a quick test we may also check that the orders are indeed as claimed. Specifically, if we keep raising omega4 to higher powers then we except to keep visiting the same four numbers, and likewise we expect to keep visiting the same two numbers for omega2.

assert( [ pow(omega4, e, Q) for e in range(8) ] == [1, 179, 432, 254, 1, 179, 432, 254] )
assert( [ pow(omega2, e, Q) for e in range(8) ] == [1, 432,   1, 432, 1, 432,   1, 432] )

Using generators we also see that there is no need to explicitly calculate the lists w and v anymore as they are now implicitly defined by the generator. So, with these change we come back to our mission of computing the values of A at the points determined by the powers of omega4, which may then be done via A_values[i] = B_values[i % 2] + pow(omega4, i, Q) * C_values[i % 2].

The third and final insight we need is that we can of course continue this process of diving the polynomial in half: to compute e.g. B_values we break B into two polynomials D and E and then follow the same procedure; in this case D and E will be simple constants but it works in the general case as well. The only requirement is that the length L is a power of 2 and that we can find a generator omegaL of a subgroup of this size.

Algorithm for powers of 2

Putting the above into an algorithm we get the following, where omega is assumed to be a generator of order len(A_coeffs). Note that some typical optimizations are omitted for clarity (but see e.g. the Python notebook).

def fft2_forward(A_coeffs, omega):
    if len(A_coeffs) == 1:
        return A_coeffs

    # split A into B and C such that A(x) = B(x^2) + x * C(x^2)
    B_coeffs = A_coeffs[0::2]
    C_coeffs = A_coeffs[1::2]
    
    # apply recursively
    omega_squared = pow(omega, 2, Q)
    B_values = fft2_forward(B_coeffs, omega_squared)
    C_values = fft2_forward(C_coeffs, omega_squared)
        
    # combine subresults
    A_values = [0] * len(A_coeffs)
    L_half = len(A_coeffs) // 2
    for i in range(L_half):
        
        j = i
        x = pow(omega, j, Q)
        A_values[j] = (B_values[i] + x * C_values[i]) % Q
        
        j = i + L_half
        x = pow(omega, j, Q)
        A_values[j] = (B_values[i] + x * C_values[i]) % Q
      
    return A_values

With this procedure we may convert a polynomial in coefficient form to its point-value form, i.e. evaluate the polynomial, in Oh(L * log L) operations.

The freedom we gave up to achieve this is that the number of coefficients L must now be a power of 2; but of course, some of the them may be zero so we are still free to choose the degree of the polynomial as we wish up to L-1. Also, we are no longer free to choose any set of evaluation points but have to choose a set with a certain subgroup structure.

Finally, it turns out that we can also use the above procedure to go in the opposite direction from point-value form to coefficient form, i.e. interpolate the least degree polynomial. We see that this is simply done by essentially treating the values as coefficients followed by a scaling, but won’t go into the details here.

def fft2_backward(A_values, omega):
    L_inv = inverse(len(A_values))
    A_coeffs = [ (a * L_inv) % Q for a in fft2_forward(A, inverse(omega)) ]
    return A_coeffs

Here however we may feel a stronger impact of the constraints implied by the FFT: while we can use zero coefficients to “patch up” the coefficient representation of a lower degree polynomial to make its length match our target length L but keeping its identity, we cannot simply add e.g. zero pairs to a point-value representation as it may change the implicit least degree polynomial; as we will see in the next blog post this has implications for our application to secret sharing if we also want to use the FFT for reconstruction.

Algorithm for powers of 3

Unsurprisingly there is nothing in the principles behind the FFT that means it will only work for powers of 2, and other bases can indeed be used as well. Luckily perhaps, since this plays a big part in our application to secret sharing as we will see below.

To adapt the FFT algorithm to powers of 3 we instead assume that the list of coefficients of A has such a length, and split it into three polynomials B, C, and D such that A(x) = B(x^3) + x * C(x^3) + x^2 * D(x^3), and we use the cube of omega in the recursive calls instead of the square. Here omega is again assumed be a generator of order len(A_coeffs), but this time a power of 3.

def fft3_forward(A_coeffs, omega):
    if len(A_coeffs) == 1:
        return A_coeffs

    # split A into B, C, and D such that A(x) = B(x^3) + x * C(x^3) + x^2 * D(x^3)
    B_coeffs = A_coeffs[0::3]
    B_coeffs = A_coeffs[1::3]
    B_coeffs = A_coeffs[2::3]
    
    # apply recursively
    omega_cubed = pow(omega, 3, Q)
    B_values = fft3_forward(B_coeffs, omega_cubed)
    C_values = fft3_forward(B_coeffs, omega_cubed)
    D_values = fft3_forward(B_coeffs, omega_cubed)
        
    # combine subresults
    A_values = [0] * len(A_coeffs)
    L_third = len(A_coeffs) // 3
    for i in range(L_third):
        
        j  = i
        x  = pow(omega, j, Q)
        xx = pow(x, 2, Q)
        A_values[j] = (B_values[i] + x * C_values[i] + xx * D_values[i]) % Q
        
        j  = i + L_third
        x  = pow(omega, j, Q)
        xx = pow(x, 2, Q)
        A_values[j] = (B_values[i] + x * C_values[i] + xx * D_values[i]) % Q
        
        j  = i + L_third + L_third
        x  = pow(omega, j, Q)
        xx = pow(x, 2, Q)
        A_values[j] = (B_values[i] + x * C_values[i] + xx * D_values[i]) % Q

    return A_values

And again we may go in the opposite direction and perform interpolation by simply treating the values as coefficients and performing a scaling.

Optimizations

For easy of presentation we have omitted some typical optimizations here, perhaps most typically the fact that for powers of 2 we have the property that pow(omega, i, Q) == -pow(omega, i + L/2, Q), meaning we can cut the number of exponentiations in fft2 in half compared to what we did above.

More interestingly, the FFTs can be also run in-place and hence reusing the list in which the input is provided. This saves memory allocations and has a significant impact on performance. Likewise, we may gain improvements by switching to another number representation such as Montgomery form. Both of these approaches are described in further detail elsewhere.

We can now return to applying the FFT to the secret sharing schemes. As mentioned earlier, using this instead of the more traditional approaches makes most sense when the vectors we are dealing with are above a certain size, such as if we are generating many shares or sharing many secrets together.

Shamir’s scheme

In this scheme we can easily sample our polynomial directly in coefficient representation, and hence the FFT is only relevant in the second step where we generate the shares. Concretely, we can directly sample the polynomial with the desired number of coefficients to match our privacy threshold, and add extra zeros to get a number of coefficients matching the number of shares we want; below the former list is denoted as small and the latter as large. We then apply the forward FFT to turn this into a list of values that we take as the shares.

def shamir_share(secret):
    small_coeffs = [secret] + [random.randrange(Q) for _ in range(T)]
    large_coeffs = small_coeffs + [0] * (ORDER_LARGE - len(small_coeffs))
    large_values = fft3_forward(large_coeffs, OMEGA_LARGE)
    shares = large_values
    return shares

Besides the privacy threshold T and the number of shares N, the parameters needed for the scheme is hence a prime Q and a generator OMEGA_LARGE of order ORDER_LARGE == N + 1.

Note that we’ve used the FFT for powers of 3 here to be consistent with the next scheme; the FFT for powers of 2 would of course also have worked.

Packed scheme

Recall that for this scheme it is less obvious how we can sample our polynomial directly in coefficient representation, and hence we instead do so in point-value representation. Specifically, we first use the backward FFT for powers of 2 to turn such a polynomial into coefficient representation, and then as above use the forward FFT for powers of 3 on this to generate the shares.

We are hence dealing with two sets of points: those used during sampling, and those used during share generation – and these cannot overlap! If they did the privacy guarantee would no longer be satisfied and some of the shares might literally equal some of the secrets.

Preventing this from happening is the reason we use the two different bases 2 and 3: by picking co-prime bases, i.e. gcd(2, 3) == 1, the subgroups will only have the point 1 in common (as the two generators raised to the zeroth power). As such we are safe if we simply make sure to exclude the value at point 1 from being used. Recalling our walk-through example, this is the reason we used prime Q == 433 since its order Q-1 == 432 == 4 * 9 * k is divided by both a power of 2 and a power of 3.

So to do sharing we first sample the values of the polynomial, fixing the value at point 1 to be a constant (in this case zero). Using the backward FFT we then turn this into a small list of coefficients, which we then as in Shamir’s scheme extend with zero coefficients to get a large list of coefficients suitable for running through the forward FFT. Finally, since the first value obtained from this corresponds to point 1, and hence is the same as the constant used before, we remove it before returning the values as shares.

def packed_share(secrets):
    small_values = [0] + secrets + [random.randrange(Q) for _ in range(T)]
    small_coeffs = fft2_backward(small_values, OMEGA_SMALL)
    large_coeffs = small_coeffs + [0] * (ORDER_LARGE - ORDER_SMALL)
    large_values = fft3_forward(large_coeffs, OMEGA_LARGE)
    shares = large_values[1:]
    return shares

For this scheme, besides T, N, and the number K of secrets packed together, the parameters for this scheme is hence the prime Q and the two generators OMEGA_SMALL and OMEGA_LARGE of order respectively ORDER_SMALL == T + K + 1 and ORDER_LARGE == N + 1.

We will talk more about how to do efficient reconstruction in the next blog post, but note that if all the shares are known then the above sharing procedure can efficiently be run backwards by simply running the two FFTs in their opposite direction.

def packed_reconstruct(shares):
    large_values = [0] + shares
    large_coeffs = fft3_backward(large_values, OMEGA_LARGE)
    small_coeffs = large_coeffs[:ORDER_SMALL]
    small_values = fft2_forward(small_coeffs, OMEGA_SMALL)
    secrets = small_values[1:K+1]
    return secrets

However this only works if all shares are known and correct: any loss or tampering will get in the way of using the FFT for reconstruction, unless we add an additional ingredient. Fixing this is the topic of the next blog post.

Performance evaluation

To test the performance impact of using the FFT for share generation in Shamir’s scheme, we let the number of shares N take on values 2, 8, 26, 80 and 242, and for each of them compare against the typical approach of using Horner’s rule. For the former we have an asymptotic complexity of Oh(N * log N) while for the latter we have Oh(N * T), and as such it is also interesting to vary T; we do so with T = N/2 and T = N/4, representing respectively a medium and low privacy threshold.

All measures are in nanoseconds (1/1,000,000 milliseconds) and performed with our Rust implementation.

plt.figure(figsize=(20,10))

shares    = [   2,   8,   26,    80 ] #,    242 ]

n2_fft    = [ 214, 402, 1012,  2944 ] #,  10525 ]
n2_horner = [  51, 289, 2365, 22278 ] #, 203630 ]

n4_fft    = [ 227, 409, 1038,  3105 ] #,  10470 ]
n4_horner = [  54, 180, 1380, 11631 ] #, 104388 ]

plt.plot(shares, n2_fft,    'ro--', color='b', label='T = N/2: FFT')
plt.plot(shares, n2_horner, 'rs--', color='r', label='T = N/2: Horner')
plt.plot(shares, n4_fft,    'ro--', color='c', label='T = N/4: FFT')
plt.plot(shares, n4_horner, 'rs--', color='y', label='T = N/4: Horner')

plt.legend(loc=2)
plt.show()

Note that the numbers for N = 242 are omitted in the graph to avoid hiding the results for the smaller values.

For the packed scheme we keep T = N/4 and K = N/2 fixed (meaning R = 3N/4) and let N vary as above. We then compare three different approaches for generating shares, all starting out with sampling a polynomial in point-value representation:

FFT + FFT: Backward FFT to convert into coefficient representation, followed by forward FFT for evaluation
FFT + Horner: Backward FFT to convert into coefficient representation, followed by Horner’s rule for evaluation
Lagrange: Use precomputed Lagrange constants for share points to directly obtain shares

where the third option requires additional storage for the precomputed constants (computing them on the fly increases the running time significantly but can of course be amortized away if processing a large number of batches).

plt.figure(figsize=(20,10))

shares =       [    8,   26,    80,    242 ]

fft_fft      = [  840, 1998,  5288,  15102 ]
fft_horner   = [  898, 3612, 37641, 207087 ]
lagrange_pre = [  246, 1367, 16510, 102317 ]

plt.plot(shares, fft_fft,      'ro--', color='b', label='FFT + FFT')
plt.plot(shares, fft_horner,   'ro--', color='r', label='FFT + Horner')
plt.plot(shares, lagrange_pre, 'rs--', color='y', label='Lagrange (precomp.)')

plt.legend(loc=2)
plt.show()

We note that the Lagrange approach remains superior up to the setting with 26 shares, after which it’s interesting to use the two step FFT.

From this small amount of empirical data the FFT seems like the obvious choice as soon as the number of shares is sufficiently high. Question of course, is in which applications this is the case. We will explore this further in a future blog post (or see e.g. our paper).

Parameter Generation

Since there are no security implications in re-using the same fixed set of parameters (i.e. Q, OMEGA_SMALL, and OMEGA_LARGE) across applications, parameter generation is perhaps less important compared to for instance key generation in encryption schemes. Nonetheless, one of the benefits of secret sharing schemes is their ability to avoid big expansion factors by using parameters tailored to the use case; concretely, to pick a field of just the right size. As such we shall now fill in this final piece of the puzzle and see how a set of parameters fitting with the FFTs used in the packed scheme can be generated.

Our main abstraction is the generate_parameters function which takes a desired minimum field size in bits, as well as the number of secrets k we which to packed together, the privacy threshold t we want, and the number n of shares to generate. Accounting for the value at point 1 that we are throwing away (see earlier), to be suitable for the two FFTs, we must then have that k + t + 1 is a power of 2 and that n + 1 is a power of 3.

To next make sure that our field has two subgroups with those number of elements, we simply need to find a field whose order is divided by both numbers. Specifically, since we’re considering prime fields, we need to find a prime q such that its order q-1 is divided by both sizes. Finally, we also need a generator g of the field, which can be turned into generators omega_small and omega_large of the subgroups.

def generate_parameters(min_bitsize, k, t, n):
    order_small = k + t + 1
    order_large = n + 1
    
    order_divisor = order_small * order_large
    q, g = find_prime_field(min_bitsize, order_divisor)
    
    order = q - 1
    omega_small = pow(g, order // order_small, q)
    omega_large = pow(g, order // order_large, q)
    
    return q, omega_small, omega_large

Finding our q and g is done by find_prime_field, which works by first finding a prime of the right size and with the right order. To then also find the generator we need a piece of auxiliary information, namely the prime factors in the order.

def find_prime_field(min_bitsize, order_divisor):
    q, order_prime_factors = find_prime(min_bitsize, order_divisor)
    g = find_generator(q, order_prime_factors)
    return q, g

The reason for this is that we can use the prime factors of the order to efficiently test whether an arbitrary candidate element in the field is in fact a generator with that order. This follows from Lagrange’s theorem as detailed in standard textbooks on the matter.

def find_generator(q, order_prime_factors):
    order = q - 1
    for candidate in range(2, q):
        for factor in order_prime_factors:
            exponent = order // factor
            if pow(candidate, exponent, q) == 1:
                break
        else:
            return candidate

This leaves us with only a few remaining question regarding finding prime numbers as explained next.

Finding primes

To find a prime q with the desired structure (i.e. of a certain minimum size and whose order q-1 has a given divisor) we may either do rejection sampling of primes until we hit one that satisfies our need, or we may construct it from smaller parts so that it by design fits with what we need. The latter appears more efficient so that is what we will do here.

Specifically, given min_bitsize and order_divisor we will do rejection sampling over two values k1 and k2 until q = k1 * k2 * order_divisor + 1 is a probable prime. The k1 is used to ensure that the minimum size is met, and k2 is used to give us a bit of wiggle room – it can in principle be omitted, but empirical tests show that it doesn’t have to be very large it give an efficiency boost, at the expense of potentially overshooting the desired field size by a few bits. Finally, since we also need to know the prime factorization of q - 1, and since this in general is believed to be an inherently slow process, we by construction ensure that k1 is a prime so that we only have to factor k2 and order_divisor, which we assume to be somewhat small and hence doable.

def find_prime(min_bitsize, order_divisor):
    while True:
        k1 = sample_prime(min_bitsize)
        for k2 in range(128):
            q = k1 * k2 * order_divisor + 1
            if is_prime(q):
                order_prime_factors  = [k1]
                order_prime_factors += prime_factor(k2)
                order_prime_factors += prime_factor(order_divisor)
                return q, order_prime_factors

Sampling primes are done using a standard randomized primality test.

def sample_prime(bitsize):
    lower = 1 << (bitsize-1)
    upper = 1 << (bitsize)
    while True:
        candidate = random.randrange(lower, upper)
        if is_prime(candidate):
            return candidate

And factoring a number is done by simply trying a fixed set of all small primes in sequence; this will of course not work if the input is too large, but that is not likely to happen in real-world applications.

def prime_factor(x):
    factors = []
    for prime in SMALL_PRIMES:
        if prime > x: break
        if x % prime == 0:
            factors.append(prime)
            x = remove_factor(x, prime)
    assert(x == 1)
    return factors

Putting these pieces together we end up with an efficient procedure for generating parameters for use with FFTs: finding large fields of size e.g. 128bits is a matter of milliseconds.

Next Steps

While we have seen that the Fast Fourier Transform can be used to greatly speed up the sharing process, it has a serious limitation when it comes to speeding up the reconstruction process: in its current form it requires all shares to be present and untampered with. As such, for some applications we may be forced to resort to the more traditional and slower approaches of Newton or Laplace interpolation.

In the next blog post we will look at a technique for also using the Fast Fourier Transform for reconstruction, using techniques from error correction codes to account for missing or faulty shares, yet get similar speedup benefits to what we achieved here.

Secret Sharing, Part 1

2017-06-04T12:00:00+00:00

TL;DR: first part in a series where we look at secret sharing schemes, including the lesser known packed variant of Shamir’s scheme, and give full and efficient implementations; here we start with the textbook approaches, with follow-up posts focusing on improvements from more advanced techniques for sharing and reconstruction.

Secret sharing is an old well-known cryptographic primitive, with existing real-world applications in e.g. Bitcoin signatures and password management. But perhaps more interestingly, secret sharing also has strong links to secure computation and may for instance be used for private machine learning.

The essence of the primitive is that a dealer wants to split a secret into several shares given to shareholders, in such a way that each individual shareholder learns nothing about the secret, yet if sufficiently many re-combine their shares then the secret can be reconstructed. Intuitively, the question of trust changes from being about the integrity of a single individual to the non-collaboration of several parties: it becomes distributed.

Secret sharing schemes are also interesting from a performance point of view, as they typically rely on a bare minimum of cryptographic assumptions. In particular, by not having to make any assumptions about the hardness of certain problems such as factoring integers, computing discrete logarithms, or finding short vectors, secret sharing schemes can provide a computational advantage compared to other cryptographic tools such as homomorphic encryption.

In this post we’ll look at a few concrete secret sharing schemes, as well as hints on how to implement them efficiently (with a later post going into more detail). We won’t focus too much on applications but simply use private aggregation of large vectors as a running example – see e.g. our paper for more use cases.

There is a Python notebook containing the code samples, yet for better performance our open source Rust library is recommended.

Parts of this blog post are derived from work done at Snips and originally appearing in another blog post. That work also included parts of the Rust implementation.

Let’s first assume that we have fixed a finite field to which all secrets and shares belong, and in which all computation take place; this could for instance be the integers modulo a prime number, i.e. { 0, 1, ..., Q-1 } for a prime Q.

An easy way to split a secret x from this field into say three shares x1, x2, x3, is to simply pick x1 and x2 at random and let x3 = x - x1 - x2. As argued below, this hides the secret as long as no one knows more than two shares, yet if all three shares are known then x can be reconstructed by simply computing x1 + x2 + x3. More generally, this scheme is known as additive sharing and works for any N number of shares by picking T = N - 1 random values.

def additive_share(secret):
    shares  = [ random.randrange(Q) for _ in range(N-1) ]
    shares += [ (secret - sum(shares)) % Q ]
    return shares

def additive_reconstruct(shares):
    return sum(shares) % Q

That the secret remains hidden as long as at most T = N - 1 shareholders collaborate follows from the marginal distribution of the view of up to T shareholders being independent of the secret. More intuitively, given at most T shares, any guess one may make at what the secret could be, can be explained by the remaining unseen share, and is hence an equally valid guess.

def explain(seen_shares, guess):
    # compute the unseen share that justifies the seen shares and the guess
    simulated_unseen_share = (guess - sum(seen_shares)) % Q
    # and the would-be sharing by combining seen and unseen shares
    simulated_shares = seen_shares + [simulated_unseen_share]
    if additive_reconstruct(simulated_shares) == guess:
        # found an explanation
        return simulated_unseen_share

seen_shares = shares[:N-1]

for guess in range(Q):
    explanation = explain(seen_shares, guess)
    if explanation is not None: 
        print("guess %d can be explained by %d" % (guess, explanation))

guess 0 can be explained by 28
guess 1 can be explained by 29
guess 2 can be explained by 30
guess 3 can be explained by 31
guess 4 can be explained by 32
guess 5 can be explained by 33
...

And since all we need for this argument to go through is the ability to sample random field elements, with no additional constraints on the size of the field due to e.g. hardness assumptions, this scheme is highly efficient both in terms of time and space.

Homomorphic addition

While it is also about as simple as it gets, notice that the scheme already has a homomorphic property that allows for certain degrees of secure computation: we can add secrets together, so if e.g. x1, x2, x3 is a sharing of x and y1, y2, y3 is a sharing of y, then x1+y1, x2+y2, x3+y3 is a sharing of x + y, which can be computed individually by the three shareholders simply by adding the shares they already have (respectively x1 and y1, x2 and y2, and x3 and y3). Then, once added, these new shares can be used reconstruct the result of the addition but not the addends.

def additive_add(x, y):
    return [ (xi + yi) % Q for xi, yi in zip(x, y) ]

More generally, we can ask the shareholders to compute linear functions of secret inputs without them seeing anything but the shares, and without learning anything besides the final output of the function.

Comparing schemes

While the above scheme is particularly simple, below are two examples of slightly more advanced schemes. One way to compare these is through the following four parameters:

N: the number of shares that each secret is split into
R: the minimum number of shares needed to reconstruct the secret
T: the maximum number of shares that may be seen without learning nothing about the secret, also known as the privacy threshold
K: the number of secrets shared together

where, logically, we must have R <= N since otherwise reconstruction is never possible, and we must have T < R since otherwise privacy makes little sense.

For the additive scheme we have R = N, K = 1, and T = R - K, but below we will get rid of the first two of these constraints so that in the end we are free to choose the parameters any way we like as long as T + K = R <= N.

Shamir’s Scheme

The additive scheme lacks some robustness by the constraint that R = N, meaning that if one of the shareholders for some reason becomes unavailable or losses his share then reconstruction is no longer possible. By moving to a different scheme we can remove this constraint and let R (and hence also T) be free to choose for any particular application.

In Shamir’s scheme, instead of picking random field elements that sum up to the secret x as we did above, to share x we sample a random polynomial f with the condition that f(0) = x and evaluate this polynomial at N non-zero points to obtain the shares as f(1), f(2), …, f(N).

def shamir_share(secret):
    polynomial = sample_shamir_polynomial(secret)
    shares = [ evaluate_at_point(polynomial, p) for p in SHARE_POINTS ]
    return shares

And by varying the degree of f we can choose how many shares are needed before reconstruction is possible, thereby removing the R = N constraint. More specifically, if the degree of f is T then we know from interpolation that it is uniquely identified by either its T+1 coefficients or by its value at T+1 points, so that R = T+1 shares allow us to reliably reconstruct.

def shamir_reconstruct(shares):
    polynomial = [ (p, v) for p, v in zip(SHARE_POINTS, shares) if v is not None ]
    secret = interpolate_at_point(polynomial, 0)
    return secret

And at the same time, given at most T shares, the secret is again guaranteed to be hidden since we also here can find an explanation for any guess: a guess is the value of f at point zero, so together with the T known shares, interpolation allows us to find a polynomial with the right degree that matches all values.

Before discussing how these operations can be done efficiently, let’s first see the properties this scheme has in terms of secure computation.

Homomorphic addition and multiplication

Since it holds for polynomials in general that f(i) + g(i) = (f + g)(i), we also here have an additive homomorphic property that allows us to compute linear functions of secrets by simply adding the individual shares.

def shamir_add(x, y):
    return [ (xi + yi) % Q for xi, yi in zip(x, y) ]

And because it also holds that f(i) * g(i) = (f * g)(i), we in fact now have an additional multiplicative property that allows us to compute products in the same fashion.

def shamir_mul(x, y):
    return [ (xi * yi) % Q for xi, yi in zip(x, y) ]

But while this is in principle enough to perform any computation without seeing the inputs (since addition and multiplication can be used to express any boolean circuit), it also comes with a caveat: unlike addition, every multiplication doubles the degree of the polynomial, so we need 2T+1 shares to reconstruct a product instead of T+1.

As a result, when used in secure computation, additional steps must be taken to reduce the degree after even a small number of multiplications, which typically involve some level of interaction between the shareholders. In this light, when compared to homomorphic encryption, secret sharing in some respect replaces heavy computation with interaction.

The missing pieces

Above we ignored the questions of how to efficiently sample, evaluate, and interpolate polynomials. The first one is easy. We want a random T degree polynomial with the constraint that f(0) = x, and we may obtain that by simply letting the zero-degree coefficient be x and picking the remaining T coefficients at random: f(X) = (x) + (r1 * X^1) + (r2 * X^2) + ... + (rT * X^T) where x is the secret, X the indeterminate, and r1, …, rT the random coefficients.

def sample_shamir_polynomial(zero_value):
    coefs = [zero_value] + [ random.randrange(Q) for _ in range(T) ]
    return coefs

This gives us the polynomial in coefficient representation, which means we can perform the second task of evaluating the polynomial at N points somewhat efficiently using e.g. Horner’s rule.

def evaluate_at_point(coefs, point):
    result = 0
    for coef in reversed(coefs):
        result = (coef + point * result) % Q
    return result

The interpolation step needed in reconstruction is slightly trickier. Here the polynomial is instead given in a point-value representation consisting of T+1 pairs (pi, vi) that is less obviously suitable for computing f(0).

However, using Lagrange interpolation we can express the value of a polynomial at any point by a weighted sum of a set of constants and its value at T+1 other points.

def interpolate_at_point(points_values, point):
    points, values = zip(*points_values)
    constants = lagrange_constants_for_point(points, point)
    return sum( ci * vi for ci, vi in zip(constants, values) ) % Q

Moreover, since these Lagrange constants depend only on the points and not on the values, their computation can be amortized away in case we have to perform several interpolations, as in our running example with large vectors of secrets.

def lagrange_constants_for_point(points, point):
    constants = [0] * len(points)
    for i in range(len(points)):
        xi = points[i]
        num = 1
        denum = 1
        for j in range(len(points)):
            if j != i:
                xj = points[j]
                num = (num * (xj - point)) % Q
                denum = (denum * (xj - xi)) % Q
        constants[i] = (num * inverse(denum)) % Q
    return constants

Looking back at the sharing and reconstruction operations, we then see that the former takes Oh(N * T) steps (for each secret) and the latter Oh(T) steps (for each secret) if precomputation is allowed.

Packed Variant

While Shamir’s scheme gets rid of the R = N constraint and gives us flexibility in choosing T or R, it still has the limitation that K = 1. This means that each shareholder receives one share per secret, so a large number of secrets means a large number of shares for each shareholder. By using a generalised variant of Shamir’s scheme known as packed or ramp sharing, we can remove this limitation and reduce the load on each individual shareholder.

To share a vector of K secrets x = [x1, x2, ..., xK], the shares are still computed as f(1), f(2), …, f(N) but the random polynomial is now sampled such that it satisfies f(-1) = x1, f(-2) = x2, …, f(-K) = xK.

Since it’s less obvious how to sample such a polynomial in coefficient representation as we did before, to achieve the desires privacy threshold we instead add T additional constraints f(-K-1) = r1, …, f(-K-T) = rT and simply use a point-value representation of the degree T+K-1 polynomial.

def sample_packed_polynomial(secrets):
    points = SECRET_POINTS + RANDOMNESS_POINTS
    values = secrets + [ random.randrange(Q) for _ in range(T) ]
    return list(zip(points, values))

This however means that we now have to perform interpolation instead of evaluation during sharing, which has an impact on efficiency, even when using precomputation as it now means storing N different sets of constants.

def packed_share(secrets):
    polynomial = sample_packed_polynomial(secrets)
    shares = [ interpolate_at_point(polynomial, p) for p in SHARE_POINTS ]
    return shares

As we will see in the next blog post it is in fact also possible to sample a packed polynomial in the coefficient representation and regain efficient sharing, but it requires slightly more advanced techniques.

def packed_reconstruct(shares):
    points = SHARE_POINTS
    values = shares
    polynomial = [ (p,v) for p,v in zip(points, values) if v is not None ]
    return [ interpolate_at_point(polynomial, p) for p in SECRET_POINTS ]

Leaving computational efficiency aside, with this scheme we have reduced the number of shares each shareholder gets by a factor of K, which is useful in our running example of aggregating large vectors.

However there’s another a caveat: since the degree of the polynomial increased, from T to T + K - 1, we also have to adjust either the privacy threshold or the number of shares needed to reconstruct.

For example, say we use Shamir’s scheme to share a secret between N = 10 shareholders and want a privacy guarantee against up to half of them collaborating, i.e. T = 5. Plugging this into our equation we get 5 + 1 = 6 <= 10 for Shamir’s scheme, meaning we can tolerate that up to N - R = 4, or 40%, of them go missing. However, if we use the packed scheme to share K = 3 secrets together then we get 5 + 3 = 8 <= 10 and the tolerance drops to 20%.

One remedy is to simply multiply all parameters by K; in the example we get 15 + 3 = 18 <= 30 and we are back to the original privacy threshold of half the shareholders and tolerance of 40%. The cost is that we now also need K times as many shareholders, so we have effectively kept the same number of shares but distributed them across a larger population.

(Note that a similar distribution may be achieved by partitioning the secrets and shareholders into K groups; this however has a negative effect on overall tolerance as we need R shares from each group.)

Homomorphic addition and multiplication

The scheme has the same homomorphic properties as Shamir’s, yet now operate in a SIMD fashion where each addition or multiplication is simultaneously performed on every secret shared together. This in itself can have benefits if it fits naturally with the application.

Next Steps

Although an old and simple primitive, secret sharing has several properties that makes it interesting as a way of delegating trust and computation to e.g. a community of users, even if the devices of these users are somewhat inefficient and unreliable.

In this post we have seen a few classical schemes as well as a typical textbook algorithms to implement them. The next blog post will improve on these algorithms and obtain significantly better performance.

Cryptography and Machine Learning

Growing TF Encrypted

A Framework for Encrypted Deep Learning

Moving Forward as a Community

Use cases under constraints

Cross-disciplinary research

Common platform

Challenges and Roadmap

High-level API

Pre-trained models

Tighter TensorFlow integration

More cryptographic techniques

Push the boundaries

Data science workflow

Conclusion

Experimenting with TF Encrypted

TensorFlow and TF Encrypted

What’s the Point?

Conclusion

Secure Computations as Dataflow Programs

Motivation

Prerequisites

Basics

Private tensors

Simple operations

Dot products

Configuration

Improvements

Tracking nodes

Reusing masked tensors

Caching values

Buffering triples

Profiling

Experiments

Thoughts

Private Image Analysis with MPC

Setting

Image Analysis with CNNs

Secure Computation with SPDZ

Tensor operations

Adapting the Model

Optimizer

Layers

Softmax and loss function

Transfer Learning

Pre-train on public dataset

Fine-tune on private dataset

Preprocessing

Adapting the Protocol

Dropout

Average pooling

Dense layers

Convolutions

Sigmoid activations

Proof of Concept Implementation

Thoughts

Generalised triples

Activation functions

Garbled circuits

Precision

Floating point arithmetic

GPUs

The SPDZ Protocol, Part 2

Triples

Underlying principle

Squaring

Dot

Powering

Share conversion

Generalised triples

The SPDZ Protocol, Part 1

Background

Setting

Secure Computation with SPDZ

Sharing and reconstruction

Linear operations

Multiplication

Next Steps

Secret Sharing, Part 3

Robust Reconstruction