We look forward to introducing the Message API to provide OpenAI compatibility with Text Generation Inference (TGI) and inference endpoints.
Starting with version 1.4.0, TGI offers an API that is compatible with the OpenAI Chat Complete API. The new messaging API allows customers and users to seamlessly migrate from the Openai model to open LLM. The API can be used directly with Openai client libraries, such as Langchain and Llamaindex, or third-party tools.
“The new message API with OpenAI compatibility allows Ryght’s real-time Genai orchestration platform to easily switch LLM use cases from OpenAI to open models. Migrating from GPT4 to MixTral/LLAMA2 to inference endpoints is easy. – Johnny Crupi, CTO of Ryght
The new messaging API is now available for inference endpoints, both dedicated and serverless flavors. To get started quickly, we have included detailed examples of the following methods:
Limitations: The message API currently does not support function calls and works only with LLMS using chat_template defined in the tokenizer configuration, as in the case of Mixtral 8x7b instructions.
Create an inference endpoint
The inference endpoint provides a secure production solution for easily deploying machine learning models from hubs into dedicated infrastructure managed by hugging your face.
In this example, we use text-generated inference to expand the finely tuned Mixtral model, Nous-Hermes-2-Mixtral-8x7b-dpo, to the inference endpoint.
You can programmatically create and manage inference endpoints programmatically by clicking a few times from the UI, or by leveraging the Huggingface_Hub Python library. Here’s how to use the hub library.
The API calls below must specify the endpoint name and model repository along with the text generation tasks: In this example, since we use a protected type, access to the deployed endpoint requires a valid hugging face token. You must also configure hardware requirements such as vendor, region, accelerator, instance type, and size. You can use this API call to view a list of available resource options and view the recommended configuration for the selected model of the catalog here.
Note: You may need to request a quota upgrade by emailing api-enterprise@huggingface.co
from huggingface_hub Import create_inference_endpoint endpoint = create_inference_endpoint (
“nous-hermes-2-mixtral-8x7b-demo”,Repository =“nousarch/nous-hermes-2-mixtral-8x7b-dpo”framework =“Pytorch”task =“Text Generation”Accelerator =“GPU”vendor =“AWS”area =“US-East-1”,
type=“protection”instance_type =“nvidia-a100”instance_size =“x2”custom_image = {
“Health_route”: “/health”,
“env”:{
“max_input_length”: “4096”,
“max_batch_prefill_tokens”: “4096”,
“max_total_tokens”: “32000”,
“max_batch_total_tokens”: “1024000”,
“model_id”: “/Repository”,},
“URL”: “ghcr.io/huggingface/text-generation-inference:sha-1734540” “,},) endpoint.wait()
printing(endpoint.status)
It will take a few minutes for the deployment to spin up. You can use the .wait() utility to block execution threads until the endpoint reaches its final “run” state. Once you run it, you can check its status and spin it through the UI playground.
Great, we have a working endpoint now!
When deployed in huggingface_hub, the endpoint scales to zero after a 15 minutes of idle time by default to optimize costs during inactivity. Check out the Hub Python Library documentation to see all the features available to manage your endpoint lifecycle.
Use the OpenAI client library to use inference endpoints
With messaging support in TGI, Incorence Endpoints is directly compatible with the OpenAI chat completion API. This means that existing scripts that use the OpenAI model can be exchanged directly through the OpenAI client library to use Open LLM running on the TGI endpoint.
This seamless transition allows you to take advantage of the many benefits that open models offer right away.
Full control and transparency over models and data means you don’t have to worry about rates. Limits the ability to fully customize the system according to your specific needs
Let’s see how.
With the Python client
The following example shows how to use the Openai Python library to perform this migration. Replace it with the endpoint URL (please include v1/suffix) and enter a valid hugging face user token in the field. It can be collected from the inference endpoint UI or from the endpoint object created with the EndPoint.url created above.
Then, as usual, use the client to pass a list of messages and stream the response from the inference endpoint.
from Openai Import OpenAI client = openai(base_url =“” + “/v1/”,api_key =“”)chat_completion = client.chat.completions.create(model =“TGI”message = ({“role”: “system”, “content”: “You’re a kind assistant.”},{“role”: “user”, “content”: “Why is open source software important?”},),stream =truth,max_tokens =500
))
for message in chat_completion:
printing(message.choices)0).delta.content, end =“”))
Behind the scenes, TGI’s messaging API uses chat templates to automatically convert a list of messages into the required instruction format of the model.
Certain OpenAI features, such as function function calls, are not compatible with TGI. Currently, the Message API supports the following chat completion parameters: Stream, Max_Tokens, Fuelch_Penalty, LogProbs, Seed, Temperature, and TOP_P.
JavaScript client and
Below is an example of the same streaming above, but using the Openai JavaScript/TypeScript library.
Import Openai from “Openai”;
const openai = new Openai({
baseurl: “” + “/v1/”,
Apike: “”,});
async function Main(){
const Stream = wait Openai.chat.completion.Create({
Model: “TGI”,
message:({ role: “system”, content: “You’re a kind assistant.” },{ role: “user”, content: “Why is open source software important?” },),
stream: truth,
max_tokens: 500,});
for wait (const A slender of stream) {process.stdout.write(Announced.Choices(0)? ..delta? .content || “”); }}
Main();
Integrate with Langchain and Llamaindex
So let’s take a look at how to use this newly created endpoint in your preferred RAG framework.
How to use with Langchain
To use it with Langchain, create an instance of Chatopenai and pass as follows:
from langchain_community.chat_models.openai Import Chatopenai llm = chatopenai(model_name =“TGI”,openai_api_key =“”,openai_api_base =“” + “/v1/”)llm.invoke (“Why is open source software important?”))
You can directly utilize the same Chatopenai class used in the Openai model. This allows all previous code to work on the endpoint by modifying only one line of code. Next, let’s use LLM declared like this to answer any questions about the content of your HF blog post.
from langchain_core.runnables Import runnable parallel
from langchain_community.embeddings Import huggingfacembeddings loader = webbaseloader(web_paths =(“https://huggingface.co/blog/open-source-llms-as-agents”),) docs = loader.load() hf_embeddings = huggingfacembeddings(model_name =“baai/bge-large-en-v1.5”)text_splitter = recursiveCharacterTextSplitter (chunk_size =512,chunk_overlap =200)splits = text_splitter.split_documents(docs)vectorstore = chroma.from_documents(documents=splits, embedding=hf_embeddings)retriever = vectorStore.as_retriever()prompt = hub.pull(“RLM/RAG-PROMPT”))
def format_docs(document):
return “\ n \ n”.join(doc.page_content for doc in docs) rag_chain_from_docs = (runnablepassthrough.assign (context = (lambda x:format_docs(x(x)“context”))))|Prompt| LLM | stroutputparser())rag_chain_with_source = runnable parallel({{“context”:Retriever, “question”:runnablePassthrough()})“According to this article, which open source model is best for agents to work?”))
{
“context”: (…)),
“question”: “According to this article, which open source model is best for agents to work?”,
“answer”: “According to this article, the Mixtral-8x7B is the best open source model for agents’ behaviour, which recommends tweaking the Mixtral that has been tweaked, surpassing GPT-3.5.,
}
How to use with Llamaindex
Similarly, you can use TGI endpoints with LlamainDex. Use the OpenAilike class to configure and instantiate additional arguments (IE_Local, is_function_calling_model, is_chat_model, context_window). Note that the arguments in the context window must match the previously set value for MAX_TOTAL_TOKENS on the endpoint.
from llama_index.llms Import OpenAilike LLM = OpenAilike(Model =“TGI”,api_key =“”,api_base =“” + “/v1/”is_chat_model =truthis_local =erroris_function_calling_model =error,context_window =32000)llm.complete(“Why is open source software important?”))
This can be used in a similar RAG pipeline. Note that previous selection of MAX_INPUT_LENGTH at inference endpoints directly affects the number of capture chunks (similarity_TOP_K) that the model can process.
from llama_index Import (ServiceContext, vectorStoreIndex,)
from llama_index Import download_loader
from llama_index.embeddings Import Huggingfacembeding
from llama_index.query_engine Import CitationQueryEngine simplewebpagereader = download_loader(“SimpleWebPagerEader”)documents = simplewebpagereader(html_to_text =truth).load_data((()“https://huggingface.co/blog/open-source-llms-as-agents”))embed_model = huggingfacembeding(model_name =“baai/bge-large-en-v1.5”) service_context = servicecontext.from_defaults(embed_model = embed_model, llm = llm)index = vectorstoreindex.from_documents(documents, service_context = service_context, show_progress =truth
)query_engine = citationQueryEngine.from_args(index, asigrativity_top_k =2,) response = query_engine.query(
“According to this article, which open source model is best for agents to work?”
) According to the article, Mixtral-8x7B is an open source model that is best suited to running agents (5). In this task you will beat GPT-3.5. However, note that Mixtral’s performance can be further improved with fine-tuning suitable for feature call and task planning skills (5).
Clean up
Once the endpoint is complete, you can pause or delete it. This step can be completed via the UI or programmatically:
endpoint.pause() endpoint.delete()
Conclusion
The new message API for text generation inference provides a smooth migration path from Openai model to opening LLMS. Can’t wait to see the powered use cases with Open LLMS running on TGI!
See this notebook for executable versions of the code outlined in the post.