Introducing the embedded hug container for Amazon Sagemaker

We look forward to announce that Amazon Sagemaker’s new embedding containers with embedding faces is now available in the general public (GA). AWS customers can now efficiently deploy embedded models in Sagemaker to build generative AI applications that include searched generator (RAG) applications.

This blog shows you how to deploy open embedded models such as Snowflakes/Snowflakes-Arctic Packaging L, Baai/bge-large-en-v1.5, or cents transformers/all-minilm-l6-v2. Snowflake/Snowflake-Flake-Arctic-embed-M-V1.5 deploys one of the best open embedded models for searching. You can check the rankings on the MTEB leaderboard.

Examples are covered:

What is the embedding container for the embedding face to hold?

Embracing face embedding containers are new dedicated inference containers for easy deployment of embedded models in a secure, managed environment. DLC is driven by Text Embedded Inference (TEI), a fiery, fast, memory-efficient solution for deploying and delivering embedded models. TEI enables high performance extraction of the most popular models, including Flagembedding, Ember, GTE, and E5. TEI implements many features, including:

Model Graph Compilation Step Small Docker Images and Fast Boot Time Token Based Dynamic Batch Flash Optimized Transformer Code for Inference, Candles, Cublaslt Safe Tesensors

TEI supports the following model architectures:

Let’s get started!

1. Setting up the development environment

Use the Sagemaker Python SDK to deploy to Amazon Sagemaker on Snowflake Arctic. You need to configure an AWS account and install the Sagemaker Python SDK.

! Pip install “Sagemaker> = 2.221.1” -Upgrade-Quiet

If you use a Sege Maker in your local environment, you must access the IAM role using the required permissions for Sagemaker. You can learn more about this.

Import Surge Maker
Import boto3 session = sagemaker.session()sagemaker_session_bucket =none
if sagemaker_session_bucket teeth none and Sess teeth do not have none:sagemaker_session_bucket =sess.default_bucket()

try: role = sagemaker.get_execution_role()
Exclude ValueError: iam = boto3.client(‘I’) role = iam.get_role(rolename =‘sagemaker_execution_role’) ()‘role’) ()“ARN”)sess = sagemaker.session(default_bucket = sagemaker_session_bucket)

printing(f “The role of the sage maker arn: {role}“))
printing(f “Sage Maker Session Region: {sess.boto_region_name}“))

2. Get a new embedding container with a hugged face

Compared to the usual hug face model expansion, you must first get the container URI and provide it to the HuggingFaceModel model class using Image_uri pointing to the image. To get the embedded container for a new hug in Amazon Sagemaker, you can use the get_huggingface_lm_image_uri method provided by the Sagemaker SDK. This method allows you to obtain the URI for the embedding container of your desired embedding. Important to note is that TEI has two different versions: CPU and GPU, so create a helper function to get the correct image URI based on the instance type.

from sagemaker.huggingface Import get_huggingface_llm_image_uri

def get_image_uri(instance_type): key = “Huggingface-tei” if instance_type.startswith(“Ml.G”)) or instance_type.startswith(“ML.P”)) Other than that “Huggingface-tei-cpu”
return get_huggingface_llm_image_uri(key, version =“1.2.3”))

3. Expand snowflakes to Amazon’s Sege Maker on Amkuchik

To deploy Snowflake/Snowflake-Arctic-ambed-M-V1.5 to Amazon Sagemaker, create a HuggingFaceModel model class and define an endpoint configuration that includes HF_MODEL_ID, Instance_Type, and more.

Import JSON
from sagemaker.huggingface Import HuggingFaceModel instance_type = “ML.G5.XLARGE”

config = {
‘hf_model_id’: “Snowflake/Snowflake – Ark Bed-M-V1.5”} emb_model = huggingfacemodel(role = role = image_uri = get_image_uri(instance_type), env = config)

After you create a HuggingFaceModel, you can deploy it to Amazon Sagemaker using the Deploy method. Expand the model with the ML.C6I.2XLARGE instance type.

emb = emb_model.deploy(initial_instance_count =1instance_type = instance_type,)

Sagemaker creates endpoints and deploys models. This can take about 5 minutes.

4. Perform and evaluate inference performance

After the endpoint is deployed, inference can be performed. Perform inferences on the endpoint using prediction methods from predictors.

data = {
“input”: “Reid’s fascinating performance will keep the film grounded and keep the audience riveted.”} res = emb.predict(data = data)

printing(f “Embedded length: {Ren(res (0))}“))
printing(f” First 10 elements of the embedding: {res(0)(::10)}“))

wonderful! Now that embeddings can be generated, you can test the performance of your model.

Send 3,900 requests to the endpoint and use the threads with 10 concurrent threads. Measures the average latency and throughput of the endpoint. The 256 tokens input will be sent to a total of 1 million tokens. I decided to find the right balance between shorter and longer inputs using 256 tokens as input length.

Note: When you run a load test, the request will be sent from Europe and the endpoint will be deployed to US-East-1. This adds network overhead latency to the request.

Import thread
Import Time number_of_threads = 10
number_of_requests = int(3900 // number_of_threads)
printing(f “Thread count: {number_of_threads}“))
printing(f “Number of requests per thread: {number_of_requests}“))

def send_requests():
for _ in range(number_of_requests): emb.predict(data = {“input”: “Hugging Face is a company in the field of natural language processing (NLP) and machine learning, and is a popular platform. It is known for its contribution to the development of the latest models for various NLP tasks and for providing a platform that encourages the sharing and use of trained models. Text generation, summaries, questions, and more are widely used in the development of NLP applications. The wider community makes cutting-edge models more accessible. The embracing face also offers a model hub where users can discover, share and download pre-trained models. It also provides tools and frameworks for developers and MAs to make them easier.”}) threads =(threading.thread(target = send_requests) for _ in range(number_of_threads)) start = time.time() (t.start() for t in Thread) (t.join() for t in thread)
printing(f “Total time: {round(time.time() – start)} Seconds”))

It took about 841 seconds to send 3,900 requests or embed 1 million tokens. This means that you can make around 5 requests per second. However, please note that it includes network latency from Europe to US-East-1. If you inspect the endpoint latency through CloudWatch, you will see that the embedded model has a latency of 2 seconds for 10 concurrent requests. This is very impressive for small and older CPU instances that cost around $150 per month. You can deploy your model to your GPU instance to get faster inference times.

Note: I ran the same test on ML.G5.XLARGE with a 1X NVIDIA A10G GPU. It took about 30 seconds to embed 1 million tokens. This means that you can perform approximately 130 requests per second. The endpoint latency is 4ms with 10 concurrent requests. ML.G5.XLARGE costs around $1.408 per hour on Amazon Sagemaker.

GPU instances are much faster than CPU instances, but are more expensive. If you want to bulk-process embeddings, you can use a GPU instance. If you want to run small endpoints at a low cost, you can use a CPU instance. We will be working on a dedicated benchmark for embedding containers for embedding embeddings in the future.

printing(f “https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsv2:graph =~(metrics~(~(‘aws*2fsagemaker~’ modelllatency~’ endpointname~’{emb.endpoint_name}~ ‘variantName~’ alltraffic))~view~ ‘timeseries~Stacked~false~Region~’{sess.boto_region_name}~Start~ ‘-PT5M~END~’ P0D~STAT~ ‘Average~~30); query =~’*7baws*2fsagemaker*2cedpointName*2cvariantName*7d*20{emb.endpoint_name}“))

5. Delete the model and endpoint

To clean up, you can delete the model and endpoint

emb.delete_model() emb.delete_endpoint()

Conclusion

New hug embedded containers allow you to easily deploy open embedded models such as Snowflake/Snowflake-Arctic-ambed-L to Amazon Sagemaker for inference. We walked through setting up development environments, obtaining containers, deploying models, and evaluating their inference performance.

This new container allows customers to easily deploy high-performance embedded models, allowing them to create sophisticated, generated AI applications with greater efficiency. I’m looking forward to what you build with the new embedding container of embedding faces for Amazon Sagemaker. Please let us know if you have any questions or feedback.

versatileai

See Full Bio

What's Hot

Webinar Report: Disformation and AI – Towards Media Educathon

How much more jointly can a multimodal model be inferred than text-and-images in a rich scene?

AI-powered security: Enhance endpoints in a changing corporate environment

How much more jointly can a multimodal model be inferred than text-and-images in a rich scene?

Unlocking conversion of web screenshots to HTML code using WebSight dataset

Easy to train your model using H100 GPU on nvidia dgx cloud

BitMart Research: MCP+AI Agent – A new framework for AI

Presight plans to expand its AI business internationally

PlanetScale Vectors GA: MySQL and AI Database Game Changer

Most Popular

BitMart Research: MCP+AI Agent – A new framework for AI

Presight plans to expand its AI business internationally

PlanetScale Vectors GA: MySQL and AI Database Game Changer

Don't Miss

Webinar Report: Disformation and AI – Towards Media Educathon

How much more jointly can a multimodal model be inferred than text-and-images in a rich scene?

AI-powered security: Enhance endpoints in a changing corporate environment

Subscribe to Updates

What's Hot

Introducing the embedded hug container for Amazon Sagemaker

What is the embedding container for the embedding face to hold?

1. Setting up the development environment

2. Get a new embedding container with a hugged face

3. Expand snowflakes to Amazon’s Sege Maker on Amkuchik

4. Perform and evaluate inference performance

5. Delete the model and endpoint

Conclusion

Related Posts