Deploying large-scale language models (LLMS) and other generated AI models can be challenging due to computational requirements and latency needs. To provide useful recommendations for businesses looking to embrace the Llama 2 with Amazon Sagemaker, we have created a comprehensive benchmark that covers more than 60 different deployment configurations for LLAMA 2.
This benchmark evaluated different sizes of the Llama 2 across a range of Amazon EC2 instances at different load levels. Our goal was to measure latency (MS per token), throughput (token per second) to find the best deployment strategy for three common use cases.
Most cost-effective deployment: Best latency deployment for users looking for low-cost, superior performance: Best throughput deployment to minimize real-time services latency: Maximize processing tokens per second
We share all assets, codes and data used and collected to maintain fair, transparent and reproducible benchmarks.
We want our customers to be able to use LLMS and LLAMA 2 efficiently and optimally for their use cases. Before we get into benchmarks and data, let’s look at the technologies and methods we used.
What is a hugging face LLM reasoning container?
Hugging Face LLM DLC is a dedicated inference container for easy deployment of LLM in a secure, managed environment. DLC is equipped with Text Generation Inference (TGI), a dedicated open source solution for deploying and serving LLM. TGI enables high-performance text generation using tensor parallelism and dynamic batches from the most popular open source LLMs, including Starcoder, Bloom, GPT-Neox, Falcon, Llama, and T5. It uses VMware, IBM, Grammarly, Open-Assistant, Uber, Scale AI, and many more text-generated inference.
What is Lama 2?
Llama 2 is the LLMS family of meta and is trained with 2 trillion tokens. The Llama 2 comes in three sizes, 7b, 13b, and 70b parameters, and introduces important improvements such as context length, commercial licensing, and optimized chat capabilities, compared to Llama (1). For more information about Llama 2, see this blog post.
What is GPTQ?
GPTQ is a post-training quantization method that compresses LLM like GPT. GPTQ compresses the GPT (decoder) model by reducing the number of bits required to store each weight from 32 bits to just 3-4 bits in the model. This means the model will have much less memory, allowing less hardware, such as a single GPU on the 13B LLAMA2 model. GPTQ analyzes each layer of the model individually and approximates the weights to maintain overall accuracy. If you want to learn more and how to use it, use GPTQ to hang Face Optimum and optimize Open LLM.
benchmark
To benchmark the actual performance of the Llama 2, we tested three model sizes (7b, 13b, 70b parameters) on four different instance types with four different load levels, resulting in 60 different configurations.
Models: We evaluated all currently available model sizes, including 7b, 13b, and 70b. Concurrent Requests: We tested the configuration with 1, 5, 10, and 20 concurrent requests to determine performance in different usage scenarios. Instance Type: We evaluated a variety of GPU instances, including G5.2XLARGE, G5.12XLARGE, G5.48XLARGE with NVIDIA A10G GPU, and P4D.24XLARGE with NVIDIA A100 40GB GPU. Quantization: Performance was compared with or without quantization. We used GPTQ 4-bit as a quantization technology.
As metrics, I used throughput and latency defined as follows:
Throughput (tokens/second): The number of tokens generated per second. Latency (MS/Token): How long it takes to generate a single token
Using them, we evaluated the performance of the llamas across different setups to understand the benefits and trade-offs. If you want to run your benchmarks yourself, I have created a GitHub repository.
Amazon Sagemaker Benchmark: Find the complete benchmark data with TGI 1.0.3 Llama 2 sheets. Raw data is available on GitHub.
If you are interested in all the details, we recommend digging deeper into the raw data provided.
Recommendations and Insights
Based on benchmarks, we provide specific recommendations for optimal LLM deployments depending on cost, throughput, and latency priorities for all LLAMA 2 model sizes.
Note: The recommendations are based on the configuration you tested. In the future, recommended environments and hardware offerings may be even more cost-effective.
The most cost-effective development
The most cost-effective configuration focuses on the right balance between performance (latency and throughput) and cost. The goal is to maximize the output per dollar you use. We looked into performance during 5 simultaneous requests. You can see that GPTQ offers the best cost-effectiveness and allows customers to deploy the Llama 2 13b on a single GPU.
Model Quantization Instance Concurrent Request Latency (MS/Token) Median Throughput (Token/sec) US West 2 hours On-Demand Cost ($/h) generates 1M Token ($) llama 2 7b gptq g5.2xlarege 5 34.245736 120.0941633 $1.58.78 llama 2 13b gptq g5.2xlarge 5 56.237484 71.70560104 $1.52 232.43 $5.87 llama 2 70b gptq ml.g5.12xlarge
Best Throughput Development
The highest throughput configuration maximizes the number of tokens generated per second. This could reduce the overall latency somewhat, as it processes more tokens at the same time. We looked at the best per token performance during 20 concurrent requests, with some respect for the cost of the instance. The best throughput was for Llama 2 13b on an ML.P4D.24XLARGE instance with 688 tokens/sec.
Model Quantization Instance Concurrent Request Latency (MS/Token) Median Throughput (Token/sec) US West 2 hours On-Demand Cost ($/h) generates 1M Token ($) llama 2 7b none ml.g5.12xlarege 20 2 13b None ML.P4D.12XLARGE 20 67.4027465 668.0204881 $37.69 24.95 $15.67 LLAMA 2 70B none ML.P4D.24XLARGE 20 59.798591 321.5369158 $37.69 51.83
Optimal latency expansion
Optimal latency configuration minimizes the time it takes to generate one token. Low latency is important for real-time use cases and providing a good experience for customers, such as chat applications. During one concurrent request, the median milliseconds per token was lowest. The overall latency was Llama 2 7B on the ML.G5.12XLARGE instance with 16.8ms/token.
Model Quantization Instance Concurrent Request Latency (MS/Token) Median (Token/sec) US West 2 hours On-Demand Cost ($/h) generates 1M Token (min) cost 1M Token ($) llama 2 7b no ml.g5.12xlarge 1 16.812526 61.4533054 $ 32.05 llama 2 13b None Ml.G5.12XLarge 1 21.002715 47.15736567 $ 7.09 353.43 $ 41.76 LLAMA 2 70B none
Conclusion
In this benchmark, I tested 60 configurations for the Llama 2 on Amazon Sagemaker. For a cost-effective deployment, the 13B Llama 2 with GPTQ on G5.2XLARGE delivered 71 tokens/sec per hour, $1.55 per hour. For maximum throughput, the 13B Llama 2 reached 296 tokens/sec on ML.G5.12XLARGE at $2.21 per million tokens. Also, for minimum latency, 7b llama 2 achieved 16ms per token with ml.g5.12xlarge.
We hope that the benchmark will help businesses to best deploy their Llama 2 based on their needs. If you want to start deploying Llama 2 on Amazon Sagemaker, check out our introduction to the Hug Face LLM Incorporation Container for Amazon Sagemaker.
Thank you for reading! If you have any questions, please feel free to contact us via Twitter or LinkedIn.