We look forward to announce that Amazon Sagemaker’s new embedding containers with embedding faces is now available in the general public (GA). AWS customers can now efficiently deploy embedded models in Sagemaker to build generative AI applications that include searched generator (RAG) applications.
This blog shows you how to deploy open embedded models such as Snowflakes/Snowflakes-Arctic Packaging L, Baai/bge-large-en-v1.5, or cents transformers/all-minilm-l6-v2. Snowflake/Snowflake-Flake-Arctic-embed-M-V1.5 deploys one of the best open embedded models for searching. You can check the rankings on the MTEB leaderboard.
Examples are covered:
What is the embedding container for the embedding face to hold?
Embracing face embedding containers are new dedicated inference containers for easy deployment of embedded models in a secure, managed environment. DLC is driven by Text Embedded Inference (TEI), a fiery, fast, memory-efficient solution for deploying and delivering embedded models. TEI enables high performance extraction of the most popular models, including Flagembedding, Ember, GTE, and E5. TEI implements many features, including:
Model Graph Compilation Step Small Docker Images and Fast Boot Time Token Based Dynamic Batch Flash Optimized Transformer Code for Inference, Candles, Cublaslt Safe Tesensors
TEI supports the following model architectures:
Let’s get started!
1. Setting up the development environment
Use the Sagemaker Python SDK to deploy to Amazon Sagemaker on Snowflake Arctic. You need to configure an AWS account and install the Sagemaker Python SDK.
! Pip install “Sagemaker> = 2.221.1” -Upgrade-Quiet
If you use a Sege Maker in your local environment, you must access the IAM role using the required permissions for Sagemaker. You can learn more about this.
Import Surge Maker
Import boto3 session = sagemaker.session()sagemaker_session_bucket =none
if sagemaker_session_bucket teeth none and Sess teeth do not have none:sagemaker_session_bucket =sess.default_bucket()
try: role = sagemaker.get_execution_role()
Exclude ValueError: iam = boto3.client(‘I’) role = iam.get_role(rolename =‘sagemaker_execution_role’) ()‘role’) ()“ARN”)sess = sagemaker.session(default_bucket = sagemaker_session_bucket)
printing(f “The role of the sage maker arn: {role}“))
printing(f “Sage Maker Session Region: {sess.boto_region_name}“))
2. Get a new embedding container with a hugged face
Compared to the usual hug face model expansion, you must first get the container URI and provide it to the HuggingFaceModel model class using Image_uri pointing to the image. To get the embedded container for a new hug in Amazon Sagemaker, you can use the get_huggingface_lm_image_uri method provided by the Sagemaker SDK. This method allows you to obtain the URI for the embedding container of your desired embedding. Important to note is that TEI has two different versions: CPU and GPU, so create a helper function to get the correct image URI based on the instance type.
from sagemaker.huggingface Import get_huggingface_llm_image_uri
def get_image_uri(instance_type): key = “Huggingface-tei” if instance_type.startswith(“Ml.G”)) or instance_type.startswith(“ML.P”)) Other than that “Huggingface-tei-cpu”
return get_huggingface_llm_image_uri(key, version =“1.2.3”))
3. Expand snowflakes to Amazon’s Sege Maker on Amkuchik
To deploy Snowflake/Snowflake-Arctic-ambed-M-V1.5 to Amazon Sagemaker, create a HuggingFaceModel model class and define an endpoint configuration that includes HF_MODEL_ID, Instance_Type, and more.
Import JSON
from sagemaker.huggingface Import HuggingFaceModel instance_type = “ML.G5.XLARGE”
config = {
‘hf_model_id’: “Snowflake/Snowflake – Ark Bed-M-V1.5”} emb_model = huggingfacemodel(role = role = image_uri = get_image_uri(instance_type), env = config)
After you create a HuggingFaceModel, you can deploy it to Amazon Sagemaker using the Deploy method. Expand the model with the ML.C6I.2XLARGE instance type.
emb = emb_model.deploy(initial_instance_count =1instance_type = instance_type,)
Sagemaker creates endpoints and deploys models. This can take about 5 minutes.
4. Perform and evaluate inference performance
After the endpoint is deployed, inference can be performed. Perform inferences on the endpoint using prediction methods from predictors.
data = {
“input”: “Reid’s fascinating performance will keep the film grounded and keep the audience riveted.”} res = emb.predict(data = data)
printing(f “Embedded length: {Ren(res (0))}“))
printing(f” First 10 elements of the embedding: {res(0)(::10)}“))
wonderful! Now that embeddings can be generated, you can test the performance of your model.
Send 3,900 requests to the endpoint and use the threads with 10 concurrent threads. Measures the average latency and throughput of the endpoint. The 256 tokens input will be sent to a total of 1 million tokens. I decided to find the right balance between shorter and longer inputs using 256 tokens as input length.
Note: When you run a load test, the request will be sent from Europe and the endpoint will be deployed to US-East-1. This adds network overhead latency to the request.
Import thread
Import Time number_of_threads = 10
number_of_requests = int(3900 // number_of_threads)
printing(f “Thread count: {number_of_threads}“))
printing(f “Number of requests per thread: {number_of_requests}“))
def send_requests():
for _ in range(number_of_requests): emb.predict(data = {“input”: “Hugging Face is a company in the field of natural language processing (NLP) and machine learning, and is a popular platform. It is known for its contribution to the development of the latest models for various NLP tasks and for providing a platform that encourages the sharing and use of trained models. Text generation, summaries, questions, and more are widely used in the development of NLP applications. The wider community makes cutting-edge models more accessible. The embracing face also offers a model hub where users can discover, share and download pre-trained models. It also provides tools and frameworks for developers and MAs to make them easier.”}) threads =(threading.thread(target = send_requests) for _ in range(number_of_threads)) start = time.time() (t.start() for t in Thread) (t.join() for t in thread)
printing(f “Total time: {round(time.time() – start)} Seconds”))
It took about 841 seconds to send 3,900 requests or embed 1 million tokens. This means that you can make around 5 requests per second. However, please note that it includes network latency from Europe to US-East-1. If you inspect the endpoint latency through CloudWatch, you will see that the embedded model has a latency of 2 seconds for 10 concurrent requests. This is very impressive for small and older CPU instances that cost around $150 per month. You can deploy your model to your GPU instance to get faster inference times.
Note: I ran the same test on ML.G5.XLARGE with a 1X NVIDIA A10G GPU. It took about 30 seconds to embed 1 million tokens. This means that you can perform approximately 130 requests per second. The endpoint latency is 4ms with 10 concurrent requests. ML.G5.XLARGE costs around $1.408 per hour on Amazon Sagemaker.
GPU instances are much faster than CPU instances, but are more expensive. If you want to bulk-process embeddings, you can use a GPU instance. If you want to run small endpoints at a low cost, you can use a CPU instance. We will be working on a dedicated benchmark for embedding containers for embedding embeddings in the future.
printing(f “https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsv2:graph =~(metrics~(~(‘aws*2fsagemaker~’ modelllatency~’ endpointname~’{emb.endpoint_name}~ ‘variantName~’ alltraffic))~view~ ‘timeseries~Stacked~false~Region~’{sess.boto_region_name}~Start~ ‘-PT5M~END~’ P0D~STAT~ ‘Average~~30); query =~’*7baws*2fsagemaker*2cedpointName*2cvariantName*7d*20{emb.endpoint_name}“))
5. Delete the model and endpoint
To clean up, you can delete the model and endpoint
emb.delete_model() emb.delete_endpoint()
Conclusion
New hug embedded containers allow you to easily deploy open embedded models such as Snowflake/Snowflake-Arctic-ambed-L to Amazon Sagemaker for inference. We walked through setting up development environments, obtaining containers, deploying models, and evaluating their inference performance.
This new container allows customers to easily deploy high-performance embedded models, allowing them to create sophisticated, generated AI applications with greater efficiency. I’m looking forward to what you build with the new embedding container of embedding faces for Amazon Sagemaker. Please let us know if you have any questions or feedback.