Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Compact multimodal intelligence for corporate documents

April 5, 2026

Google’s new open model based on Gemini 2.0

April 5, 2026

5 best practices for securing AI systems

April 4, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Monday, April 6
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Hug your text generation inference for available faces in AWS inference 2
Tools

Hug your text generation inference for available faces in AWS inference 2

versatileaiBy versatileaiJuly 23, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email



David Corvoysier's avatar

We look forward to announce the general availability of Face Text Generation Inference (TGI) on AWS Ersentia2 and Amazon Sagemaker.

Text Generation Inference (TGI) is a dedicated solution for deploying and delivering large-scale language models (LLMs) for large-scale production workloads. TGI uses the most popular open LLM tensor parallelism and continuous batches, including Llama, Mistral and more, to enable high-performance text generation. Text generation inference is used in production by companies such as Grammarly, Uber, and Deutsche Telekom.

Integration of TGI with Amazon Sagemaker in conjunction with AWS Emsentia2 provides a powerful GPU solution and viable alternative to build production LLM applications. Seamless integration ensures easy deployment and maintenance of models, making LLM accessible and scalable to a wide range of production use cases.

Amazon Sagemaker’s AWS-recommended new TGI can benefit from the same technologies that drive the very final low-latency LLM experience, such as Huggingchat, Openassant, and LLM serverless endpoints for HuggingFace Hub.

Use Amazon Sagemaker to deploy Zephyr 7b to AWS recommendations

This tutorial shows how easy it is to deploy cutting-edge LLMs such as Zephyr 7b in AWS Imedertia2 using Amazon Sagemaker. Zephyr is a fine-tuned version of 7B of Mistralai/Mistral-7B-V0.1 trained with published combinations of synthetic datasets using direct priority optimization (DPO), as detailed in the Technical Report. This model is released under the Apache 2.0 license and guarantees wide range of accessibility and use.

We show you how:

Setup Development Environment TGI Neuronx Image Retrieve Zephyr7b to Amazon Sagemaker and chat with the model

Let’s get started.

1. Setting up the development environment

Use the Sagemaker Python SDK to deploy Zephyr to Amazon Sagemaker. You need to configure an AWS account and install the Sagemaker Python SDK.

! Pip Install Transformer “Sagemaker> = 2.206.0” -Upgrade-Quiet

If you are using a Seigai car in a local environment. You must access the IAM role using the permissions required by Sagemaker. You can learn more about this.

Import Surge Maker
Import boto3 session = sagemaker.session()sagemaker_session_bucket =none
if sagemaker_session_bucket teeth none and Sess teeth do not have none:sagemaker_session_bucket =sess.default_bucket()

try: role = sagemaker.get_execution_role()
Exclude ValueError: iam = boto3.client(‘I’) role = iam.get_role(rolename =‘sagemaker_execution_role’) ()‘role’) ()“ARN”)sess = sagemaker.session(default_bucket = sagemaker_session_bucket)

printing(f “The role of the sage maker arn: {role}“))
printing(f “Sage Maker Session Region: {sess.boto_region_name}“))

2. Get images of TGI neurons

You can perform AWS inference inference using the new embracing face TGI neuron DLCS. You can use the get_huggingface_llm_image_uri method in the SageMaker SDK to get the appropriate HuggingFace TGI Neuronx DLC URI based on the desired backend, session, region, and version. You can find all the versions available here.

Note: At the time of writing this blog post, the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri method. Use the raw container URI instead.

from sagemaker.huggingface Import get_huggingface_llm_image_uri llm_image = get_huggingface_llm_image_uri(
“Huggingface-neuronx”version =“0.0.20”
))

printing(f “llm image uri: {llm_image}“))

4. Deploy Zephyr 7b to Amazon Sagemaker

Recommended Text Generation Inference (TGI) supports general open LLM including Llama, Mistral, and more. Here you can see the complete list of supported models (text generations).

Compile LLMS for guessing 2

At the time of writing, AWS guessing does not support the dynamic shape of inference. This means that you must specify the sequence length and batch size in advance. To make the customer’s recommended full power easier to use, we created a neuron model cache containing the precompiled configurations of the most popular LLM. Cached configurations are defined through model architecture (mistral), model size (7b), neuron version (2.16), number of estimated cores (2), batch size (2), and sequence length (2048).

This means you don’t need to compile the model yourself, but you can use precompiled models from the cache. Examples of this are Mistralai/Mistral-7B-V0.1 and HuggingfaceH4/Zephyr-7B-beta. You can find compiled/cached configurations with a hugged facehub. If the desired configuration is not cached yet, you can compile it yourself using the best CLI or open the request in the cache repository

In this post I recompiled Huggingfaceh4/zephyr-7b-beta using the following commands and parameters from my inf2.8xlarge instance and pushed it into the hub of AWS-Neuron/Zephyr-7B-Seqlen-2048-BS-4-CORES-2.

Optimum-Cli export Neuron -M Huggingfaceh4/Zephyr-7b-beta – Batch_size 4-sequence_length2048 -num_cores 2 -auto_cast_type bf16 ./zephyr-7b-beta-neuron huggingface-cli upload aws-neuron/zephyr-7b-seqlen-2048-bs-4-4 ./zephyr-7b-beta-neuron ./- exclude “Checkpoint/**”

Python -C “From the transformer; AutoTokenizer.from_pretrained(‘Huggingfaceh4/Zephyr-7b-beta’). Push_to_hub(‘aws-neuron/zephyr-7b-seqlen-2048-bs-4’)”

If you are trying to compile LLM with a configuration that is not yet cached, it can take up to 45 minutes.

Deploying TGI Neuron Endpoints

Before you can deploy the model to Amazon Sagemaker, you must define a TGI Neuronx endpoint configuration. You must ensure that the following additional parameters are defined:

HF_NUM_CORES: The number of neuronal cores used for the compilation. hf_batch_size: The batch size used to compile the model. HF_Sequence_Length: The sequence length used to compile the model. HF_AUTO_CAST_TYPE: The autocast type used to compile the model.

The traditional TGI parameters must be defined below:

HF_MODEL_ID: The face model ID to hug. HF_TOKEN: The embracing face API token accesses the gate model. max_batch_size: The maximum batch size that a model can process is equal to the batch size used for compilation. max_input_length: The maximum input length that the model can process. max_total_tokens: The maximum tokens that a model can generate is equal to the sequence length used for compilation.

Import JSON
from sagemaker.huggingface Import HuggingFaceModel instance_type = “ml.inf2.8xlarge”
health_check_timeout = 1800

config = {
“hf_model_id”: “Huggingfaceh4/Zephyr-7b-beta”,
“HF_NUM_CORES”: “2”,
“hf_batch_size”: “4”,
“hf_sequence_length”: “2048”,
“hf_auto_cast_type”: “BF16”,
“max_batch_size”: “4”,
“max_input_length”: “1512”,
“max_total_tokens”: “2048”} llm_model = huggingfacemodel (role = role, image_uri = llm_image, env = config)

After you create a HuggingFaceModel, you can deploy it to Amazon Sagemaker using the Deploy method. Expand the model with the ML.INF2.8XLARGE instance type.

llm = llm_model.deploy(initial_instance_count =1instance_type = instance_type, container_startup_health_check_timeout = health_check_timeout,)

Sagemaker creates endpoints and deploys models. This can take 10-15 minutes.

5. Perform inference and chat with the model

After the endpoint is deployed, inferences can be performed using prediction methods from predictors. Different parameters can be provided to affect the generation and added to the parameter attributes of the payload. Supported parameters can be found here in TGI’s open API specification in the SWAGGER documentation

Huggingface H4/Zephyr-7B-beta is a conversational chat model. This means you can chat using a quick structure like this:

\nyou are friendly. \ n \ ninstruction \ n \ n

Manually preparing the prompt is error-prone and can be used from the tokensor using the Apply_Chat_Template method. Expect the famous OpenAI formatted message dictionary and convert it to the correct format of the model. Let’s see if Zephyr knows some facts about AWS.

from transformer Import AutoTokenizer Tokenizer = autotokenizer.from_pretrained(“AWS-Neuron/Zephyr-7B-Seqlen-2048-BS-4-Cores-2”) Message = ({“role”: “system”, “content”: “You’re an AWS expert.”},{“role”: “user”, “content”: “Can you tell me any interesting facts about AWS?”},) prompt = tokenizer.apply_chat_template(messages, tokenize =erroradd_generation_prompt =truth) Payload = {
“do_sample”: truth,
“TOP_P”: 0.6,
“temperature”: 0.9,
“TOP_K”: 50,
“max_new_tokens”: 256,
“Repetition_Penalty”: 1.03,
“return_full_text”: error,
“Stop”🙁“”)} chat = llm.predict({“input”:prompt, “parameter”:payload})

printing(chat(0) ()“generated_text”) ()Ren(prompt):))

Amazing, I deployed Zephyr to Amazon Sagemaker to Imedentia2 and chatted with it.

6. Clean up

To clean up, you can delete the models and endpoints.

llm.delete_model() llm.delete_endpoint()

Conclusion

Face Text Generation Inference (TGI) recommended by AWS and integration with Amazon Sagemaker provides a cost-effective alternative solution for deploying large-scale language models (LLMS).

We are actively working to support more models, streamline the editing process and improve our cache system.

Thank you for reading! If you have any questions, please feel free to contact us via Twitter or LinkedIn.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGrexa AI has secured $185 million in seed funding to revolutionize AI and small business marketing
Next Article Amazon salaries for engineers and analysts
versatileai

Related Posts

Tools

Compact multimodal intelligence for corporate documents

April 5, 2026
Tools

Google’s new open model based on Gemini 2.0

April 5, 2026
Tools

5 best practices for securing AI systems

April 4, 2026
Add A Comment

Comments are closed.

Top Posts

We had Claude fine-tune our open source LLM

December 5, 202513 Views

Faster Text Generation with Self-Speculative Decoding

February 13, 202512 Views

Build a great dataset for video generation

February 12, 202512 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

We had Claude fine-tune our open source LLM

December 5, 202513 Views

Faster Text Generation with Self-Speculative Decoding

February 13, 202512 Views

Build a great dataset for video generation

February 12, 202512 Views
Don't Miss

Compact multimodal intelligence for corporate documents

April 5, 2026

Google’s new open model based on Gemini 2.0

April 5, 2026

5 best practices for securing AI systems

April 4, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?