10.2. Convolutional Neural Network (CNN)

Convolutional neural networks are quite complicated cases with FHE. Since the encrypted cyphertext is the most atomic form of the input data that we can access and we need to be able to multiply subsets of the data by different amounts we use a sparse n-dimensional array with the weights embedded in different positions and the rest zeroed out (see Kernel-Masquerade and Hadmard Product). This way we can still convolve out filters/ kernels but instead of manipulating \(x\) (normally by selecting a slice of \(x\) that you were interested in) we manipulate the filter instead generating windows where the filter should be placed, and caching these windows for later use in back-propogation so we know exactly what input \(x\) multiplied which weights once \(x\) is finally decrypted. Each window becoming a different branch \(<t>\) and the total number of windows our total branches \(T_x\).

10.2.1. CNN Equations

Standard neuron equation (not compatible with our representations, where \(w_0\) is actually the bias):

(1)\[a = g(\sum_{i=1}^{N}(w_ix_i)+w_0)\]

Kernel-Masquerade, weighted, and biased cross correlation.

(2)\[a^{(i)<t>} = g(\sum_{t=0}^{T_x-1}(k^{<t>}x^{(i)})+b/N)\]

10.2.1.1. CNN Derivatives

The derivative of a CNN (\(f\)) with respect to the bias \(b\):

(3)\[\frac{df(x)}{db} = T_x \frac{dg}{dx}\]

The derivative of a CNN (\(f\)) with respect to the weights multi-dimensional array \(w\) is the sum of all portions of \(x^{(i)}\) unmasked during product calculation:

(4)\[\frac{df(x)}{dw} = \sum_{t=0}^{T_x}(x^{(i)<t>})\frac{dg}{dx}\]

The derivative of a CNN (\(f\)) with respect to the input \(x\):

(5)\[\frac{df(x)}{dx} = \sum_{t=0}^{T_x}(k^{(i)<t>})\frac{dg}{dx}\]

Note

where:

\(x\):
- \(x^{(i)}\) ; the multidimensional-input array used as the \(i\)’th training example / pass of the network. E.G cnn.forward is one whole forward pass.
- \(x_{n}^{(i)<t>}\) ; The \(n\)’th input value of the multi-dimensional input array \(x^{(i)}\). Corresponding to the \(i\)’th training example of the network, and branch/ time-step \(t\).
\(T_x\) and \(t\):
- \(T_x\) ; The total number of branches per input array \(x\). No need for \(T_x^{(i)}\) as branches should be the same every time.
- \(t\) ; The current (relative)/ \(t\)’th timestep/ branch.
\(N\) and \(n\):
- \(N\); the total number of elements in any individual multi-dimensional input array \(x\)
- \(n\); the \(n\)’th input element any individual multi-dimensional input array \(x\), e.g \(x_n\) is the \(n\)’th value \(x\) in the multi-dimensional array \(x\).
\(g\) and \(a\)
- \(g\); some activation function e.g \(\sigma_a\) (see:\sigma_a(x))
- \(a\); the sum of output / activation of this neural network (if the last network then \(a=\hat{y}\))
\(y\) and \(\hat{y}:\)
- \(y\); the (normalized) ground-truth / observed outcome
- \(\hat{y}\); the (normalized) prediction of \(y\)
\(w\), \(k\), and \(b\):
- \(w\); a weight
- \(b\); a bias
- \(k\); a kernel that multiplies over some input data, for us this is the Kernel-Masquerade

Please also note, this is with respect to each network. One networks output activation \(a\) might be another networks input \(x\)

10.2.2. Kernel-Masquerade

The Kernel-Masquerade is the combining of a convolutional kernel and a mask to simultaneously calculate the product of any given kernel/ filter over a dataset. This is to act as though we were capable of normally slicing some input data which is impossible when this data is embedded in the FHE cyphertext. We deploy kernel weights embedded in a sparse/ zeroed multi-dimensional array, thus ignoring undesired values in the input cyphertext when finding the Hadmard Product of the two, and minimising computational depth of cyphertexts in processing.

10.2.3. Hadmard Product

The Hadmard product is simply two equally shaped n-dimensional arrays operated on element wise to produce a third product n-dimensional array of the same size/ shape:

10.2.4. Commuted-Sum

Warning

If you are intending to write your own or extend an additional neural network please pay special attention to the Commuted-Sum, it will change the behavior of the networks drastically and in some unexpected ways if it is not accounted for. CNNs and other single input multiple branch networks are the source of commuted-sums.

In our neural networks operating on fully homomorphically encrypted cyphertexts, the cyphertexts encode into themselves multiple values, I.E for us they include all the values in a single training example \((i)\), to give a more concrete example, for a CNN one cyphertext is the whole of a single image, thats all pixel values of the image. This means computations happen quicker and take up less space as there is less overhead since we do not need different parameters between each pixel value, instead we can encode and encrypted them all using the same parameters, reducing duplication, and allowing us to operate on all of them simultaneously. A consequence to this approach however is that the cyphertext is the most atomic form of the data we have, it cannot be any smaller, it will always be a polynomial with \(N\) number of values encoded in it. In practice this means we cannot sum a cyphertext, as it would mean reducing all these values into one single value I.E the sum of the cyphertext, or to put it a different way, we cannot fold the cyphertext in on itself to get this one single number, as this is homomorphic encryption I.E structure preserving self similar encryption, the form of what comes in is the form of what comes out.

In the case of CNNs however as you can see in CNN Equations there is a sum. As we have stated it is impossible to sum the cyphertext in on itself, thus since almost all our networks have to be/ use abelian operations we can commute this sum until after we have decrypted as long as we pay special attention to the fact that we have big matrices that should be treated as if they were singular values. The most egregious possible violation is broadcasting of a single addition, as this acts differently on a single value than it does on a matrix of values, thus the need to always divide a bias \(b\) by the total number of values being biased \(N\). Assuming our calculus is correct and we succeffully implement a commuted sum until after decryption, this should only ever be a problem in one dimension, all subsequent sums would happen between cyphertexts (since it should have been summed already) meaning there is only ever one commuted sum, we never have to worry about a stack of them.

10.2.4.1. CNN API

Note

You may see some or no content at all on this page that documents the API. If that is the case please build the documentation locally (Docker Build), or view this documentation using a “autodoc”-ed version of this documentation (see: Documentation Variations).

CNN has been split into its constituent operations.

See fhez.nn.operations.cnn -> Cross Correlation (CC)