Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Strengthen enterprise governance for cutting-edge AI workloads

April 13, 2026

Safetensors joins PyTorch Foundation

April 13, 2026

Updates to Veo, Imagen, VideoFX and introduction of Whisk in Google Labs

April 12, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, April 14
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Open LLMS from Openai and hug your face message API
Tools

Open LLMS from Openai and hug your face message API

versatileaiBy versatileaiJuly 17, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

We look forward to introducing the Message API to provide OpenAI compatibility with Text Generation Inference (TGI) and inference endpoints.

Starting with version 1.4.0, TGI offers an API that is compatible with the OpenAI Chat Complete API. The new messaging API allows customers and users to seamlessly migrate from the Openai model to open LLM. The API can be used directly with Openai client libraries, such as Langchain and Llamaindex, or third-party tools.

“The new message API with OpenAI compatibility allows Ryght’s real-time Genai orchestration platform to easily switch LLM use cases from OpenAI to open models. Migrating from GPT4 to MixTral/LLAMA2 to inference endpoints is easy. – Johnny Crupi, CTO of Ryght

The new messaging API is now available for inference endpoints, both dedicated and serverless flavors. To get started quickly, we have included detailed examples of the following methods:

Limitations: The message API currently does not support function calls and works only with LLMS using chat_template defined in the tokenizer configuration, as in the case of Mixtral 8x7b instructions.

Create an inference endpoint

The inference endpoint provides a secure production solution for easily deploying machine learning models from hubs into dedicated infrastructure managed by hugging your face.

In this example, we use text-generated inference to expand the finely tuned Mixtral model, Nous-Hermes-2-Mixtral-8x7b-dpo, to the inference endpoint.

You can programmatically create and manage inference endpoints programmatically by clicking a few times from the UI, or by leveraging the Huggingface_Hub Python library. Here’s how to use the hub library.

The API calls below must specify the endpoint name and model repository along with the text generation tasks: In this example, since we use a protected type, access to the deployed endpoint requires a valid hugging face token. You must also configure hardware requirements such as vendor, region, accelerator, instance type, and size. You can use this API call to view a list of available resource options and view the recommended configuration for the selected model of the catalog here.

Note: You may need to request a quota upgrade by emailing api-enterprise@huggingface.co

from huggingface_hub Import create_inference_endpoint endpoint = create_inference_endpoint (
“nous-hermes-2-mixtral-8x7b-demo”,Repository =“nousarch/nous-hermes-2-mixtral-8x7b-dpo”framework =“Pytorch”task =“Text Generation”Accelerator =“GPU”vendor =“AWS”area =“US-East-1”,
type=“protection”instance_type =“nvidia-a100”instance_size =“x2”custom_image = {
“Health_route”: “/health”,
“env”:{
“max_input_length”: “4096”,
“max_batch_prefill_tokens”: “4096”,
“max_total_tokens”: “32000”,
“max_batch_total_tokens”: “1024000”,
“model_id”: “/Repository”,},
“URL”: “ghcr.io/huggingface/text-generation-inference:sha-1734540” “,},) endpoint.wait()
printing(endpoint.status)

It will take a few minutes for the deployment to spin up. You can use the .wait() utility to block execution threads until the endpoint reaches its final “run” state. Once you run it, you can check its status and spin it through the UI playground.

Great, we have a working endpoint now!

When deployed in huggingface_hub, the endpoint scales to zero after a 15 minutes of idle time by default to optimize costs during inactivity. Check out the Hub Python Library documentation to see all the features available to manage your endpoint lifecycle.

Use the OpenAI client library to use inference endpoints

With messaging support in TGI, Incorence Endpoints is directly compatible with the OpenAI chat completion API. This means that existing scripts that use the OpenAI model can be exchanged directly through the OpenAI client library to use Open LLM running on the TGI endpoint.

This seamless transition allows you to take advantage of the many benefits that open models offer right away.

Full control and transparency over models and data means you don’t have to worry about rates. Limits the ability to fully customize the system according to your specific needs

Let’s see how.

With the Python client

The following example shows how to use the Openai Python library to perform this migration. Replace it with the endpoint URL (please include v1/suffix) and enter a valid hugging face user token in the field. It can be collected from the inference endpoint UI or from the endpoint object created with the EndPoint.url created above.

Then, as usual, use the client to pass a list of messages and stream the response from the inference endpoint.

from Openai Import OpenAI client = openai(base_url =“” + “/v1/”,api_key =“”)chat_completion = client.chat.completions.create(model =“TGI”message = ({“role”: “system”, “content”: “You’re a kind assistant.”},{“role”: “user”, “content”: “Why is open source software important?”},),stream =truth,max_tokens =500
))

for message in chat_completion:
printing(message.choices)0).delta.content, end =“”))

Behind the scenes, TGI’s messaging API uses chat templates to automatically convert a list of messages into the required instruction format of the model.

Certain OpenAI features, such as function function calls, are not compatible with TGI. Currently, the Message API supports the following chat completion parameters: Stream, Max_Tokens, Fuelch_Penalty, LogProbs, Seed, Temperature, and TOP_P.

JavaScript client and

Below is an example of the same streaming above, but using the Openai JavaScript/TypeScript library.

Import Openai from “Openai”;

const openai = new Openai({
baseurl: “” + “/v1/”,
Apike: “”,});

async function Main(){
const Stream = wait Openai.chat.completion.Create({
Model: “TGI”,
message:({ role: “system”, content: “You’re a kind assistant.” },{ role: “user”, content: “Why is open source software important?” },),
stream: truth,
max_tokens: 500,});
for wait (const A slender of stream) {process.stdout.write(Announced.Choices(0)? ..delta? .content || “”); }}

Main();

Integrate with Langchain and Llamaindex

So let’s take a look at how to use this newly created endpoint in your preferred RAG framework.

How to use with Langchain

To use it with Langchain, create an instance of Chatopenai and pass as follows:

from langchain_community.chat_models.openai Import Chatopenai llm = chatopenai(model_name =“TGI”,openai_api_key =“”,openai_api_base =“” + “/v1/”)llm.invoke (“Why is open source software important?”))

You can directly utilize the same Chatopenai class used in the Openai model. This allows all previous code to work on the endpoint by modifying only one line of code. Next, let’s use LLM declared like this to answer any questions about the content of your HF blog post.

from langchain_core.runnables Import runnable parallel
from langchain_community.embeddings Import huggingfacembeddings loader = webbaseloader(web_paths =(“https://huggingface.co/blog/open-source-llms-as-agents”),) docs = loader.load() hf_embeddings = huggingfacembeddings(model_name =“baai/bge-large-en-v1.5”)text_splitter = recursiveCharacterTextSplitter (chunk_size =512,chunk_overlap =200)splits = text_splitter.split_documents(docs)vectorstore = chroma.from_documents(documents=splits, embedding=hf_embeddings)retriever = vectorStore.as_retriever()prompt = hub.pull(“RLM/RAG-PROMPT”))

def format_docs(document):
return “\ n \ n”.join(doc.page_content for doc in docs) rag_chain_from_docs = (runnablepassthrough.assign (context = (lambda x:format_docs(x(x)“context”))))|Prompt| LLM | stroutputparser())rag_chain_with_source = runnable parallel({{“context”:Retriever, “question”:runnablePassthrough()})“According to this article, which open source model is best for agents to work?”))

{
“context”: (…)),
“question”: “According to this article, which open source model is best for agents to work?”,
“answer”: “According to this article, the Mixtral-8x7B is the best open source model for agents’ behaviour, which recommends tweaking the Mixtral that has been tweaked, surpassing GPT-3.5.,
}

How to use with Llamaindex

Similarly, you can use TGI endpoints with LlamainDex. Use the OpenAilike class to configure and instantiate additional arguments (IE_Local, is_function_calling_model, is_chat_model, context_window). Note that the arguments in the context window must match the previously set value for MAX_TOTAL_TOKENS on the endpoint.

from llama_index.llms Import OpenAilike LLM = OpenAilike(Model =“TGI”,api_key =“”,api_base =“” + “/v1/”is_chat_model =truthis_local =erroris_function_calling_model =error,context_window =32000)llm.complete(“Why is open source software important?”))

This can be used in a similar RAG pipeline. Note that previous selection of MAX_INPUT_LENGTH at inference endpoints directly affects the number of capture chunks (similarity_TOP_K) that the model can process.

from llama_index Import (ServiceContext, vectorStoreIndex,)
from llama_index Import download_loader
from llama_index.embeddings Import Huggingfacembeding
from llama_index.query_engine Import CitationQueryEngine simplewebpagereader = download_loader(“SimpleWebPagerEader”)documents = simplewebpagereader(html_to_text =truth).load_data((()“https://huggingface.co/blog/open-source-llms-as-agents”))embed_model = huggingfacembeding(model_name =“baai/bge-large-en-v1.5”) service_context = servicecontext.from_defaults(embed_model = embed_model, llm = llm)index = vectorstoreindex.from_documents(documents, service_context = service_context, show_progress =truth
)query_engine = citationQueryEngine.from_args(index, asigrativity_top_k =2,) response = query_engine.query(
“According to this article, which open source model is best for agents to work?”
) According to the article, Mixtral-8x7B is an open source model that is best suited to running agents (5). In this task you will beat GPT-3.5. However, note that Mixtral’s performance can be further improved with fine-tuning suitable for feature call and task planning skills (5).

Clean up

Once the endpoint is complete, you can pause or delete it. This step can be completed via the UI or programmatically:

endpoint.pause() endpoint.delete()

Conclusion

The new message API for text generation inference provides a smooth migration path from Openai model to opening LLMS. Can’t wait to see the powered use cases with Open LLMS running on TGI!

See this notebook for executable versions of the code outlined in the post.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGoogle Cloud launches AI-driven cybersecurity in Indonesia
Next Article Generic ai ‘pure theoretical’ terrorist potential
versatileai

Related Posts

Tools

Strengthen enterprise governance for cutting-edge AI workloads

April 13, 2026
Tools

Safetensors joins PyTorch Foundation

April 13, 2026
Tools

Updates to Veo, Imagen, VideoFX and introduction of Whisk in Google Labs

April 12, 2026
Add A Comment

Comments are closed.

Top Posts

How to save millions of online casinos with artificial intelligence -5 important ways

January 24, 20254 Views

Safetensors joins PyTorch Foundation

April 13, 20263 Views

Why companies like Apple are developing limited AI agents

April 12, 20263 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

How to save millions of online casinos with artificial intelligence -5 important ways

January 24, 20254 Views

Safetensors joins PyTorch Foundation

April 13, 20263 Views

Why companies like Apple are developing limited AI agents

April 12, 20263 Views
Don't Miss

Strengthen enterprise governance for cutting-edge AI workloads

April 13, 2026

Safetensors joins PyTorch Foundation

April 13, 2026

Updates to Veo, Imagen, VideoFX and introduction of Whisk in Google Labs

April 12, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?