Towards encrypted large-scale language models with FHE

Large Language Model (LLM) has recently proven to be a reliable tool for increasing productivity in many areas such as programming, content creation, text analysis, web search, and distance learning.

Impact of large-scale language models on user privacy

Despite the appeal of LLMs, privacy concerns remain regarding the user queries processed by these models. While leveraging the capabilities of an LLM is desirable, there is a risk of exposing confidential information to the LLM service provider. In some fields, such as medicine, finance, and law, this privacy risk is significant.

One possible solution to this problem is on-premises deployment, where the LLM owner deploys the model to the client’s machine. However, this is not the best solution because building an LLM can cost millions of dollars ($4.6 million for GPT3) and on-premises deployments carry the risk of exposing the model’s intellectual property (IP).

Zama believes you can get the best of both worlds. Our goal is to protect both user privacy and model IP. In this blog, we will show you how to leverage the Hugging Face transformer library to run some of these models on encrypted data. The complete code can be found in this use case.

Fully homomorphic encryption (FHE) can solve LLM privacy challenges

Zama’s solution to the challenges of LLM deployment is to use Fully Homomorphic Encryption (FHE), which enables the execution of functions on encrypted data. It is possible to achieve the goal of protecting the model owner’s IP while maintaining the privacy of the user’s data. This demonstration shows that the LLM model implemented in FHE maintains the predictive quality of the original model. To do this, we need to adapt the GPT2 implementation of the Hugging Face transformer library and rework the inference section using Concrete-Python. This allows you to convert Python functions to equivalent FHE functions.

Figure 1 shows the GPT2 architecture with a repeating structure. That is, a series of multi-head attention (MHA) layers are applied in succession. Each MHA layer uses the model weights to project the input, compute the attention mechanism, and reproject the attention output to a new tensor.

In TFHE, model weights and activations are expressed as integers. Nonlinear functions must be implemented using programmable bootstrap (PBS) operations. PBS implements table lookup (TLU) operations on encrypted data while simultaneously updating the ciphertext to enable arbitrary computations. The disadvantage is that the computation time for PBS is longer than that for linear operations. By leveraging these two types of operations, subparts of LLM calculations, or even complete LLM calculations, can be expressed in FHE.

Implementing the LLM layer using FHE

Next, we will show you how to encrypt a single attention head of a multi-head attention (MHA) block. You can also find a complete MHA block example in this use case.

Figure 2. shows a simplified overview of the underlying implementation. The client starts inference locally up to the first layer removed from the shared model. The user encrypts the intermediate operations and sends them to the server. The server applies some of the attention mechanisms and the results are returned to the client, which can then decrypt them and continue local inference.

Quantization

First, to perform model inference on encrypted values, the model weights and activations must be quantized and converted to integers. The ideal is to use post-training quantization, which does not require retraining the model. In this process, we implement an attention mechanism compatible with FHE and investigate its impact on LLM accuracy using integers and PBS.

To evaluate the impact of quantization, run the full GPT2 model with a single LLM head operating on encrypted data. Next, we evaluate the accuracy obtained when changing the number of quantization bits for both weights and activations.

This graph shows that 4-bit quantization maintains 96% of the original precision. The experiments are conducted using a dataset of approximately 80 sentences. The metric is computed by comparing the logit predictions from the original model to the model using the quantized head model.

Applying FHE to hug face GPT2 model

Based on Hugging Face’s transformer library, we rewrite the forward pass of the encrypting module to include quantized operators. First load GPT2LMHeadModel to build a SingleHeadQGPT2Model instance, then use the QGPT2SingleHeadAttendant module to manually replace the first multihead attention module as follows: The complete implementation can be found here.

self.transformer.h(0).attn = QGPT2SingleHeadAttendant(config, n_bits=n_bits)

The forward pass is then overwritten and the first head of the multi-head attention mechanism, including the projections created to construct the query, key, and value matrices, is executed with FHE-friendly operators. The following QGPT2 modules can be found here.

class Single head caution(QGPT2):
“””Class representing a single attention head implemented using the quantization method.”””

surely run_numpy(self, q_hidden_states: np.ndarray): q_x = DualArray( float_array=self.x_calib, int_array=q_hidden_states, quantizer=self.quantizer ) mha_weights_name = f “Transformers.h.{self.layer}.attn”

head_0_indices = (
list(range(i * self.n_embd, i * self.n_embd + self.head_dim))
for I in range(3) ) q_qkv_weights = … q_qkv_bias = … q_qkv = q_x.linear(weight=q_qkv_weights,bias=q_qkv_bias,key=f”attention_qkv_proj_layer_{self.layer}”) q_qkv = q_qkv.expand_dims(axis=1key=f”Loosen_{self.layer}”) q_q, q_k, q_v = q_qkv.enc_split(
3axis=-1key=f”qkv_split_layer_{self.layer}”
) q_y = self.attention(q_q, q_k, q_v)

return self.finalize(q_y)

Other calculations in the model remain floating point, unencrypted, and are expected to be performed by on-premises clients.

Once you have loaded the pre-trained weights into your modified GPT2 model, you can call the generate method.

qgpt2_model = SingleHeadQGPT2Model.from_pretrained(
“gpt2_model”,n_bits=4use_cache=error
) Output_ids = qgpt2_model.generate(input_ids)

For example, you can ask your quantization model to complete the phrase “Cryptography is a.” If you have sufficient quantization precision when running the model in FHE, the output of the generation will be:

“Encryption is a very important part of computer security.”

If the quantization precision is too low, you will get the following results:

“Cryptography is a great way to learn about the world around you.”

Compiling to FHE

Now you can compile the attention head using the following Concrete-ML code.

circuit head = qgpt2_model;compile(input ID)

When you run this, you will see the output: “Circuit compiled to be 8-bit wide.” This configuration is compatible with FHE and indicates the maximum bit width required to perform operations on FHE.

complicated

The most computationally intensive operation in a transformer model is the attention mechanism that multiplies queries, keys, and values. In FHE, the cost is further increased by the specificity of multiplication in the encrypted domain. Moreover, as the length of the sequence increases, the number of these difficult multiplications increases quadratically.

For the encrypted head, a sequence of length 6 requires 11,622 PBS operations. This is the first experiment that is not optimized for performance. It can run in seconds, but requires significant computing power. Fortunately, hardware can improve latency by 1000x to 10000x, reducing your CPU time from minutes to this blog post.

conclusion

Although large-scale language models are great assistance tools for a variety of use cases, their implementation raises significant questions regarding user privacy. In this blog, we have described the first steps to make the entire LLM work on encrypted data where the model runs entirely in the cloud, while user privacy is fully respected.

This step involves converting certain parts in the model, such as GPT2, to the FHE realm. This implementation leverages the transformer library and allows you to evaluate the impact on accuracy when parts of your model run on encrypted data. This approach not only protects user privacy, but also allows model owners to keep large portions of their models private. The complete code can be found in this use case.

The Zama libraries Concrete and Concrete-ML (don’t forget to star the repository on GitHub ⭐️💛) allow you to easily build ML models and convert them to FHE, equivalent to being able to compute and make predictions on encrypted data.

I hope you enjoyed this post. Please feel free to share your thoughts/feedback.

versatileai

See Full Bio

What's Hot

Introducing Lyria 3.5 to Google Flow Music

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Introducing Lyria 3.5 to Google Flow Music

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

New in llama.cpp: Model Management

OpenAI pushes ChatGPT to patient health records

SenseTime’s Galaxy project aims to scale up domestic AI chips

Most Popular

New in llama.cpp: Model Management

OpenAI pushes ChatGPT to patient health records

SenseTime’s Galaxy project aims to scale up domestic AI chips

Don't Miss

Introducing Lyria 3.5 to Google Flow Music

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Subscribe to Updates

What's Hot

Towards encrypted large-scale language models with FHE

Impact of large-scale language models on user privacy

Fully homomorphic encryption (FHE) can solve LLM privacy challenges

Implementing the LLM layer using FHE

Quantization

Applying FHE to hug face GPT2 model

Compiling to FHE

complicated

conclusion

Related Posts