Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

The most cost-effective AI model ever

March 4, 2026

Google’s industrial robot AI Play makes physical AI a priority

March 4, 2026

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

March 3, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, March 5
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Skip the time of llama generation with AWS reasoning 2
Tools

Skip the time of llama generation with AWS reasoning 2

versatileaiBy versatileaiAugust 29, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

David Corvoysier's avatar

Update (02/2024): Performance has been improved even more! Check out the updated benchmarks.

In an earlier post on the Hugging Face blog, we introduced AWS Imedentia2, a second-generation AWS Ersentia Accelerator, and explained how to quickly deploy a hugging face model of standard text and vision tasks for AWS Iserencia 2 instances using Optimum-Neuron.

A further step in integration with the AWS Neuron SDK allows you to use 🤗Optim-Neuron to deploy LLM models for text generation in AWS Imeferentia2.

And what better model to choose for its demo than the Llama 2, one of the most popular models in the hugging facehub?

SETUP 🤗 Optimal Neurons for Irsentia2 Instances

Our recommendation is to use embracing facial neuron deep learning ami (dlami). Dlami is pre-packaged with all the libraries you need, including optimal neurons, neuron drivers, transformers, datasets, acceleration and more.

Alternatively, you can deploy it to Amazon Sagemaker using the embracing Face Neuron SDK DLC.

Note: Stay tuned for future posts dedicated to the development of Sagemaker.

Finally, these components can also be manually installed on a fresh recommended instance following the optimal neuron installation procedure.

Export the llama 2 model to a neuron

As explained in the Optimum-Neuron documentation, the model must be compiled and exported to a serialized format before running on the neuronal device.

Fortunately, Optimum-Neuron offers a very simple API to export standard Transformers models to neuronal format.

>>> Import from optimum.neuronNeuronModelforcausallm >>> compiler_args = {“num_cores”: 24, “auto_cast_type”: ‘fp16’} >>> input_shapes = {“batch_size”: 1, “sequence_length”: 2048} NeuronModelforcausallm.from_pretrained(“Meta-llama/llama-2-7b-hf”, export = true, ** compiler_args, ** input_shapes)

This deserves a bit of explanation:

Use Compiler_Args to specify the number of cores to expand the model (each neuronal device has two cores), and use input_shape to set the dimensions of the static input and output of the model using which precision (here float16). All model compilers require static shapes, and neurons are no exception. Note that Sequence_Length constrains not only the length of the input context, but also the length of the KV cache, and therefore the output length.

Depending on the selected parameters and the selection of estimated hosts, this can take a few minutes to an hour or more.

Luckily, you’ll need to do this only once, as you can save the model and reload it later.

>>> model.save_pretrained(“a_local_path_for_compiled_neuron_model”)

Better yet, you can shove it into the hugging face hub.

>>> model.push_to_hub(“a_local_path_for_compiled_neuron_model”, repository_id = “aws-neuron/llama-2-7b-hf-neuron-latency”))

Generate text using llama 2 in AWS recommendations

Once the model has been exported, you can generate text using the Transformers library, as explained in detail in the previous post.

>>> Import from optimum.neuronNeuronmodelforcausallm >>>> Import from transformer >>> model = neuronmodelforcausallm.from_pretrained(‘aws-neuron/llama-2-7b-hf-neuron-latency’) >>> autotokenizer.from_pretrained(‘aws-neuron/llama-2-7b-hf-neuron-latency’) >>> inputs = tokenizer(‘What is deep-rearning?’, return_tensors = “pt”) >>> outputs = model.generat top_p = 0.9) >>>> tokenizer.batch_decode(outputs, skip_special_tokens = true)(‘What is deep learning?

Note: If you pass multiple input prompts to the model, the resulting token sequence must be padded to the left with the end-of-stream token. The tokensor saved in the exported model is configured accordingly.

The following generation strategies are supported:

Greedy search, multinomial sampling using TOP-K and TOP-P (temperature).

Most logit preprocessing/filters (such as repeat penalties) are supported.

All-in-one with the optimal neuronal pipeline

For those who want to keep it simple, there is an even easier way to use the LLM model with AWS Imedentia 2 using Optim-Neuron Pipelines.

Using them is as simple as:

>>> optimum.neuron Import Pipeline >>> p = pipeline(‘Text-generation’, ‘aws-neuron/llama-2-7b-hf-neuron-budget’) >>> p(“max_new_tokens=64, do_keentex” ‘place” bevance of ” on my sample=’ max_k=50) (max_new_tokens=64, I’m a peaceful place, I love to see new places.

benchmark

But how efficient is text generation in ImedEntia2? Let’s understand!

I uploaded it to precompiled versions of the hubs on the Llama 2 7B and 13B models with various configurations.

Note: All models are compiled with a maximum sequence length of 2048.

The “budget” model of LLAMA2 7B is intended to be deployed to an Inf2.xlarge instance with only one neuronal device and sufficient CPU memory to load the model.

All other models are compiled to use the full range of cores available on the inf2.48xlarge instance.

Note: For more information about available instances, see the Imeferntia2 product page.

I’ve created two “latency” oriented configurations: the Llama2 7B and Llama2 13B models that can provide only one request at a time at full speed.

I also created two “throughput” oriented configurations to provide up to four requests in parallel.

To evaluate the model, we generate tokens up to a total sequence length of 1024 starting with 256 input tokens (i.e., we generate 256, 512, and 768 tokens).

Note: The “Budget” model number is reported, but is not included in the graph for easier readability.

Encoding time

The encoding time is the amount of time required to process the input token and generate the first output token. This is a very important metric as it accommodates the latency that users directly perceive when streaming generated tokens.

We test encoding times for increasing the context size, 256 input tokens that correspond roughly to typical Q/A usage, while 768 is more typical for search extended generation (RAG) use cases.

The “Budget” model (LLAMA2 7B-B) is deployed to an inf2.xlarge instance, while the other models are deployed to an inf2.48xlarge instance.

The encoding time is expressed in seconds.

Input token llama2 7b-l llama2 7b-t llama2 13b-l llama2 13b-t llama2 7b-b 256 0.5 0.9 0.6 1.8 0.3 512 0.7 1.6 1.1 3.0 0.4 768 1.1 3.3 1.7 5.2 0.5

llama2 Inderentia2 encoding time

It can be seen that all models deployed exhibit excellent response times, even in long contexts.

End-to-end latency

The end-to-end latency corresponds to the total time to reach a sequence length of 1024 tokens.

Therefore, encoding and generation times are included.

The “Budget” model (LLAMA2 7B-B) is deployed to an inf2.xlarge instance, while the other models are deployed to an inf2.48xlarge instance.

Latency is expressed in seconds.

New token llama2 7b-l llama2 7b-t llama2 13b-l llama2 13b-t llama2 7b-b 256 2.3 2.7 3.5 4.1 15.9 512 4.4 5.3 6.9 7.8 31.7 768 6.2 7.7 10.2 11.1 47.3

Latency of llama2 Recommended devices

All models deployed on high-end instances show good latency, even those actually configured to optimize throughput.

The “budget” for model latency has been deployed significantly, but that’s fine.

throughput

We employ the same conventions as other benchmarks to assess throughput by dividing the end-to-end latency by the sum of both the input and output tokens. In other words, divide the end-to-end latency by batch_size * sequence_length to get the number of generated tokens in 1 second.

The “Budget” model (LLAMA2 7B-B) is deployed to an inf2.xlarge instance, while the other models are deployed to an inf2.48xlarge instance.

Throughput is expressed in tokens per second.

New token llama2 7b-l llama2 7b-t llama2 13b-l llama2 13b-t llama2 7b-b 256 227 750 145 504 32 512 177 579 111 394 24 768 164 529 101 370 22 22 22

llama2 Inderentia2 throughput

Again, models deployed on high-end instances have very good throughput, even those optimized for latency.

The throughput of the “budget” model is much lower, but it’s still fine in streaming usage cases, given that the average reader reads around 5 words per second.

Conclusion

We have shown how easy it is to deploy a llama2 model from a hugging hugging face hub using hugging face hub.

The deployed model shows excellent performance in terms of encoding time, latency and throughput.

Interestingly, the latency of the deployed model is not so sensitive to batch size, opening up a way of deployment at inference endpoints that provide multiple requests in parallel.

However, there is still plenty of room for improvement:

In the current implementation, the only way to increase throughput is to increase batch size, but is currently limited by device memory. Currently, alternative options such as pipelining are integrated, and the length of the static sequence limits the model capabilities to encode long contexts. It will be interesting to see if attention sinks are a valid option to address this.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleFreedom and general welfare in the age of AI
Next Article Advocates say delays in AI regulations in Colorado provide breathing rooms, but not final solutions.
versatileai

Related Posts

Tools

The most cost-effective AI model ever

March 4, 2026
Tools

Google’s industrial robot AI Play makes physical AI a priority

March 4, 2026
Tools

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

March 3, 2026
Add A Comment

Comments are closed.

Top Posts

Open Source DeepResearch – Unlocking Search Agents

February 7, 20259 Views

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

January 7, 20267 Views

Google’s industrial robot AI Play makes physical AI a priority

March 4, 20264 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Open Source DeepResearch – Unlocking Search Agents

February 7, 20259 Views

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

January 7, 20267 Views

Google’s industrial robot AI Play makes physical AI a priority

March 4, 20264 Views
Don't Miss

The most cost-effective AI model ever

March 4, 2026

Google’s industrial robot AI Play makes physical AI a priority

March 4, 2026

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

March 3, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?