Generation AI (Genai) revolution is in earnest and innovative, so text generation using open source transformers models like the Llama 2 has become a town talk. AI enthusiasts and developers are trying to leverage the generation capabilities of such models for their own use cases and applications. This article shows how easy it is to generate text on a Llama 2 family of models (7b, 13b, 70b) using Optimum Habana and custom pipeline classes. You can run the model with just a few lines of code.
This custom pipeline class is designed to provide excellent flexibility and ease of use. Additionally, it provides a high level of abstraction and performs end-to-end text generation with pre- and post-processing. There are several ways to use pipelines. You can run the run_pipeline.py script from the best Habana repository, add pipeline classes to your own Python script, and initialize the Langchain class.
Prerequisites
The Llama 2 model is part of the Gate Repo, so you will need to request access if you haven’t done it yet. First, you must visit the META website and accept the terms and conditions. After you have been granted access from Meta (which may take 1-2 days), you will need to hug your face and request access using the same email address provided in Meta format.
Once access is granted, run the following command to log in to your hugged face account (you will need an access token that can be obtained from the user profile page):
Huggingface-Cli Login
You will also need to install the latest version of Optimum Habana and clone the repository to access the pipeline scripts. Here is the command to do so:
PIP Install Optimum-habana == 1.10.4 git clone -b v1.10-release https://github.com/huggingface/optimum-habana.git
If you plan to perform distributed inference, install DeepSpeed according to your Synapseai version. In this case, you are using Synapseai 1.14.0.
pipinstallation git+https://github.com/habanai/deepspeed.git@1.14.0
You are now set up to run text generation in your pipeline!
Use a pipeline
First, go to the next directory in the best Havana checkout where your pipeline script is located and follow the instructions in README to update your PythonPath.
CD Optimum-habana/examples/text-generation pip install -r compoestion.txt
CD Text Generation Pipeline
If you want to generate a set of text from the selected prompt, here is a sample command:
python run_pipeline.py – model_name_or_path meta-llama/llama-2-7b-hf-use_hpu_graphs-use_kv_cache – max_new_tokens 100 – do_sample-prompt “This is my prompt.”
You can also pass multiple prompts as input and change the generated temperature and TOP_P value as follows:
python run_pipeline.py – model_name_or_path meta-lama/llama-2-13b-hf – use_hpu_graphs – use_kv_cache – max_new_tokens 100 —do_sample – temperature 0.5 -top_p 0.95 -prompttpttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt “Hello World” “how are you?”
Below is a sample command to launch a pipeline with Deepspeed to generate text on a large model such as the Llama-2-70b.
python ../../gaudi_spawn.py – use_deepspeed -world_size 8 run_pipeline.py – model_name_or_path meta-lama/llama-2-70b-hf – max_new_tokens 100 – -bf16-use_hpu_grap 0.95 -prompt “Hello World” “how are you?” “This is my prompt.” “Once upon a time”
Using in Python scripts
You can use pipeline classes in your own scripts, as shown in the example below: Run the following sample script from Optim-Habana/Examples/Text-Generation/Text-Generation-Pipeline:
Import Argparse
Import Logging
from Pipeline Import gauditextgenerationpipeline
from run_generation Import setup_parser logging.basicconfig (
format=“%(asctime)s -%(levelname)s -%(name)s -%(message)s”datefmt =“%m/%d/%y%h:%m:%s”level = logging.info, ) logger = logging.getlogger(__ name__) parser = argparse.argumentparser() args = setup_parser(parser) args.num_return_sequences = 1
args.model_name_or_path = “Metalama/llama-2-7b-hf”
args.max_new_tokens = 100
args.use_hpu_graphs = truth
args.use_kv_cache = truth
args.do_sample = truth
pipe = gauditextgenerationpipeline(args,logger)prompts =(“He’s working on it.”, “Once upon a time”, “far”))
for prompt in prompt:
printing(f “Prompt: {prompt}“)Output = Pipe (Prompt)
printing(f “Generated text: {repr(output)}“))
You must run the above script using python .py – model_name_or_path a_model_name. However, you can change the model name programmatically, as shown in the Python snippet.
This indicates that the pipeline class works with string input and performs data preprocessing and postprocessing.
Lang Chain Compatibility
The text generation pipeline can be provided as input to the Langchain class via the use_with_langchain constructor argument. You can install Langchain like this:
pip install langchain == 0.0.191
Below is a sample script that shows how to use pipeline classes with Langchain:
Import Argparse
Import Logging
from langchain.llms Import Hagging facepipeline
from langchain.prompts Import prosptTemplate
from langchain.chains Import llmchain
from Pipeline Import gauditextgenerationpipeline
from run_generation Import setup_parser logging.basicconfig (
format=“%(asctime)s -%(levelname)s -%(name)s -%(message)s”datefmt =“%m/%d/%y%h:%m:%s”level = logging.info, ) logger = logging.getlogger(__ name__) parser = argparse.argumentparser() args = setup_parser(parser) args.num_return_sequences = 1
args.model_name_or_path = “Metalama/llama-2-13b-chat-hf”
args.max_input_tokens = 2048
args.max_new_tokens = 1000
args.use_hpu_graphs = truth
args.use_kv_cache = truth
args.do_sample = truth
args.temperature = 0.2
args.top_p = 0.95
pipe = gauditextgenerationpipeline(args,logger,use_with_langchain =truth)llm = huggingfacepipeline(pipeline=pipe)template= “” “Answer the question at the end using the following context. If you don’t know the answer\
Don’t try to make up for the answer by saying you don’t know.
Context: Large-scale Language Models (LLMS) is the latest model used in NLP.
It’s incredible due to better performance than the smaller model
Helps developers building NLP-enabled applications. These models
It can be accessed via Openai through Face’s “Transformers” library
Use the “Openai” library and use the “Cohere” library through Cohere.
Question: {Question}
answer: “””
PRONT = PROMPTTEMPLATE(input_variables =(“question”),Template =Template)llm_chain =llmChain(prompt = prompt, llm = llm)question = “Which library and model providers offer LLM?”
Response = LLM_CHAIN (prompt.format(Question = Question))
printing(f “Question 1: {question}“))
printing(f “Response 1: {response(‘Sentence’)}“)Question= “What context was provided?”
Response = LLM_CHAIN (prompt.format(Question = Question))
printing(f “\nquestion 2: {question}“))
printing(f “Response 2: {response(‘Sentence’)}“))
The pipeline class has been validated with Langchain version 0.0.191 and may not work with other versions of the package.
Conclusion
I presented a custom text generation pipeline that accepts single or multiple prompts as input with Intel®Gaudi®2AI accelerator. This pipeline offers excellent flexibility in terms of parameters that affect model size and text generation quality. Plus, it’s very easy to use, can also be connected to scripts, and is compatible with Langchain.
Use of the pre-protected model is subject to compliance with third-party licenses, including the LLAMA 2 Community License Agreement (Llamav2). Please read the instructions at this link https://ai.meta.com/llama/license/ for guidance regarding the intended use of the LLAMA2 model as intended users and what is considered additional terms misuse and out-of-scope use. You are sole responsibility and responsibility for following and abide by third party licenses. Havana Lab is not responsible for your use or compliance with your third party licenses. To be able to run a gate model like this llama-2-70b-hf, you need:
Accept the terms of use of the model on HF HUB model card Agree to set up read token login for your account using the HF CLI