A running document that introduces how to expand and fine -tune the DeepSeek R1 model by embracing your face on AWS.
What is Deepseek-R1?
If you have had a hard time in a tough math problem, you know how much it is to think a little longer and work carefully. OPENAI’s O1 model is very good in solving inference tasks such as mathematics, coding, and logic if LLM is trained to do the same because LLM is inferred. I showed that I was there.
However, the recipe behind Openai’s reasoning model is secret. In other words, last week, DeepSeek released the Deepseek-R1 model and quickly broke the Internet (and stock market).
Deepseek AI has expanded Deepseek-R1, and six dense models distilled from Deepseek-R1 based on LLAMA and QWEN architecture, with open source and six dense models. You can find all of them in the Deepseek R1 collection.
In cooperation with Amazon Web Services, developers will develop the latest face models that embrace AWS services so that they can build better generated AI applications.
Hold the face of AWS and check how to develop and fine -tune the DeepSeek R1 model.
Deepseek R1 model will be developed
Embrace the endpoint of Face’s reasoning and expand it to AWS
The endpoint of the hugging face inference provides a simple and safe way to develop a machine learning model in a dedicated calculation for use in AWS. The reasoning endpoint enhances developers and data scientists so that AI applications can be created without managing infrastructure. Click the deployed process several times to simplify.
With the endpoint of inference, you can develop one of the six distillation models from Deepseek-R1. In addition, UNSLOTH: Https: //huggingface.co/unsloth/deepseek-r1-r1-GGUF, a quantized version of DeepSeek R1 can be developed. On the model page, click the deployment and click the endpoint of HF reasoning. Redirected to the endpoint page of the reasoning, we have selected the recommended hardware for executing optimized inference containers and models. Once you have created the endpoint, you can use AWS to send a query to Deepseek R1 for $ 8.3 per hour.
You can find DeepSeek R1 and distillation models, and other general open LLMS, and are ready to expand into an optimized configuration of the inference endpoint model catalog.
| Note: The team is working to enable the deployment of the DeepSeek model in the recommended instance. stay tuned!
Expanded on AmazonsageMakerai with Hugging Hugging Face LLM DLCS
GPU Deepseek R1
| Note: The team is working to enable Deepseek-R1 development using the Face LLM DLC embraced by the GPU. stay tuned!
GPU distillation model
Let’s walk the development of DeepSeek-R1-DISTILL-LLAMA-70B.
Code snippets can be used on the model page under the development button!
Previously, let’s start with several prerequisites. Create a SUR, configure a sageMaker domain, set a sufficient assignment with sageMaker, and set a jupyterlab space. In the case of DeepSeek-R1-Distill-LLAMA-70B, it is necessary to increase the default assignment of ml.g6.48xlage for the use of Endpoint.
The hardware configuration recommended for each distilled variant for reference is the following:
Model Instance Type # GPU type Deepseek-R1-Distill-LLAMA-70B ML.G6.48xlarge 8 DeepSeek-AI/Deepseek-R1-Distill-32B 6.12xlarge 4 Deepseek -ai/Deepseek-R1-Distill-QWEN-14B ML.G6.12XLARGE 4 Deepseek-AI/Deepseek-R1-LLAMA-8B ML.G6.2XLARGE 1 -Distill-Qwen-7b ml.g6 .2XLARGE 1 Deepseek-AI/Deepseek-R1-Distill-Qwen-1.5b ML.G6.2XLARGE 1
When you enter the notebook, be sure to install the latest version of Sagemaker SDK.
! Pip Installation Sagemaker-Upgrade
Next, instances are used to determine the current area and the role of execution.
Import Json
Import Surge maker
Import BOTO3
from sageMaker.huggingface Import Huggingfacemodel, get_huggingface_llm_image_uri
try: Role = sageMaker.get_execution_role ()
Exclude ValueRror: IAM = boto3.client (“I”) Role = Iam.get_role (Rolename =“Sagemaker_execution_role”)“role”)“ARN”)
Create a sageMaker model object with Python SDK.
Model_id = “Deepseek-AI/DEEPSEEK-R1-DISTILL-LLAMA-70B”
Model_name = hf_model_id.split (“/”) ()1) .Lower () hub = {
“HF_MODEL_ID”: Model_id,
“Sm_num_gpus”: Json.dumps (8)} Huggingface_model = huggingfacemodel (image_uri = get_huggingface_llm_image_uri (“Hugging face”Version =“3.0.1”), ENV = Hub, role = role,)
Expand the model to the endpoint of the sage maker and test the endpoint.
Endpoint_name = f “{Model_name}-EP “
Predictor = huggingface_model.deplay (endpoint_name = Endpoint_name, initial_instance_count =1Instance_type =“Ml.g6.48xlarge”Container_startup_Health_check_timeout =2400,) Predictor.predict ({“input”: “What is the meaning of life?”})
That’s it, you have developed the LLAMA 70b Reasoning model!
Since the TGI V3 container is used under the hood, the most performance parameter of the specified hardware is automatically selected.
When the test is completed, delete the endpoint.
Predictor.delete_model () Predictor.delete_endpoint ()
Neuron distillation model
Walk the development of DeepSeek-R1-Distill-LLAMA-70B on Neuron instances, such as AWS Trainium 2 and AWS Ersentia 2.
Code snippets can be used on the model page under the development button!
The prerequisite for expanding the neuron instance is the same. Configure the sageMaker domain, configure sufficient assignments in sageMaker, and make sure it has a Jupyterlab space. In the case of DeepSeek-R1-Distill-LLAMA-70B, it is necessary to increase the default assignment of ML.inf2.48xlarge for the use of Endpoint.
Next, instances are used to determine the current area and the role of execution.
Import Json
Import Surge maker
Import BOTO3
from sageMaker.huggingface Import Huggingfacemodel, get_huggingface_llm_image_uri
try: Role = sageMaker.get_execution_role ()
Exclude ValueRror: IAM = boto3.client (“I”) Role = Iam.get_role (Rolename =“Sagemaker_execution_role”)“role”)“ARN”)
Create a sageMaker model object with Python SDK.
Image_uri = get_huggingface_llm_image_uri (“Huggingface-Neuronx”Version =“0.0.25”) Model_id = “Deepseek-AI/DEEPSEEK-R1-DISTILL-LLAMA-70B”
Model_name = hf_model_id.split (“/”) ()1) .Lower () hub = {
“HF_MODEL_ID”: Model_id,
“HF_NUM_CORES”: “twenty four”,,
“HF_AUTO_CAST_TYPE”: “BF16”,,
“Max_batch_size”: “4”,,
“Max_input_tokens”: “3686”,,
“Max_total_tokens”: “4096”,}, Huggingface_model = huggingfacemodel (image_uri = images_uri, env = Hub, role = role,)
Expand the model to the endpoint of the sage maker and test the endpoint.
Endpoint_name = f “{Model_name}-EP “
Predictor = huggingface_model.deplay (endpoint_name = Endpoint_name, initial_instance_count =1Instance_type =“ML.inf2.48xlarge”Container_startup_Health_check_timeout =3600Volume_size =512,) Predictor.predict ({
“input”: “What is the capital of France?”,,
“parameter”: {
“Do_sample”: truth,,
“Max_new_tokens”: 128,,
“temperature”: 0.7,,
“Top_k”: 50,,
“Top_p”: 0.95,}})
That’s it, you have developed a Lama 70B inference model for neuron instance! Under the bonnet, I downloaded a compiled model in advance to embrace the face and speed up the start time of the endpoint.
When the test is completed, delete the endpoint.
Predictor.delete_model () Predictor.delete_endpoint ()
Expanded EC2 neurons using Hugging Face Neuron Deep Learning Ami
This guide will explain in detail how to export, deploy, and execute Deepseek-R1-Distill-Lallama-70B in the INF2.48xlarge AWS EC2 instance.
Previously, let’s start with several prerequisites. Make sure you have registered with the neurondy planning AMI with your face hugging in the market. Provides all dependencies required to train and deploy a face model with Trainium & IRSENTIA. Next, start the INF2.48xlarge instance with the AMI and connect it through SSH. If you have never been there, you can check the step -by -step guide.
Once connected via an instance, you can use the following command to deploy the model to the endpoint.
Docker Run -P 8080: 80 \ -V $ (Pwd)/Data \ \ -device =/dev/neuron1 \ -device =/dev/neuron2 \ -device =/neuron3 \ -DEV // Neu Ron4 \ -Device =/Dev/Neuron5 \ -device =/device =/device =/device =/device =/dev/neuron9 \ –Device Uron10 \ –Device =/Dev/EURON11 \ -E hf_batch_size = 4 \ -E hf_sequence_length = 4096 \ -E hf_auto_cast_type =“BF16” \ -Ehf_num_cores = 24 \ ghcr.io/huggingface/neuronx-test \ –Model-ID deepseek-AI/Deepseek-R1-LLAMA-70B IZE 4 \ -max -total- TOKENS 4096
It takes a few minutes to download the compiled model from the face cache hugging and start the TGI endpoint.
Next, you can test the endpoint.
CURL LOCALHOST: 8080/Generate \ -X Post \ -D ‘{“INPUTS”: “Why is the sky dark at night?”}’ \ -H ‘Content-Type: Application/json’
When the test is completed, pause the EC2 instance.
| Note: The team is working to enable the development of Deepseek R1 in Trainium & IRSENTIA, Face Neuron Deep Learning AMI. stay tuned!
Fine adjusting the DeepSeek R1 model
Fine adjustment with Amazon Sagemakerai to hug your face training DLC
| Note: The team embraces all DeepSeek models so that Face Training DLC S can be fine -tuned. stay tuned!
Hugging Hugging Face Neuron Deep Arearning Fine adjustment of EC2 neuron
| Note: The team is working to make fine adjustments with Face Neuron Deep Learning Ami holding all DeepSeek models. stay tuned!