Llama 2 is here – available at Hugging Face

Llama 2 is the most advanced open access large-scale language model family released by Meta today, and we are pleased to fully support this release with comprehensive integration with Hugging Face. Llama 2 is released under a very permissive community license and is available for commercial use. Code, pre-trained models, and fine-tuned models are all released today 🔥

We worked with Meta to ensure a smooth integration into the Hugging Face ecosystem. You’ll find 12 open-access models in the hub: 3 base models, 3 tweaked models with original meta checkpoints, and corresponding Transformer models. Features and integrations being released include:

Why Llama 2?

The Llama 2 release introduces a family of pre-trained and fine-tuned LLMs ranging from 7B to 70B parameters (7B, 13B, 70B). The pre-trained model has significant improvements over the Llama 1 model, including being trained with 40% more tokens, having a much longer context length (4,000 tokens 🤯), and using grouped query attention for faster inference on 70B models 🔥!

However, the most exciting part of this release is a fine-tuned model (Llama 2-Chat) that is optimized for conversational applications using Reinforcement Learning from Human Feedback (RLHF). On a wide range of usefulness and security benchmarks, the Llama 2-Chat model outperforms most open models and achieves performance comparable to ChatGPT according to human evaluation. You can read the paper here.

Llama 2 images: Open foundation and fine-tuned chat model

If you’ve been waiting for an open alternative to closed-source chatbots, Llama 2-Chat is probably your best choice.

Model License Commercial Use? Pre-Training Length (Tokens) Leaderboard Score Falcon-7B Apache 2.0 ✅ 1,500B 44.17 MPT-7B Apache 2.0 ✅ 1,000B 47.24 Llama-7B Llama License ❌ 1,000B 45.65 Llama-2-7B Llama 2 License ✅ 2,000B 50.97 Llama-33B Llama License ❌ 1,500B – Llama-2-13B Llama 2 License ✅ 2,000B 55.69 mpt-30B Apache 2.0 ✅ 1,000B 52.77 Falcon-40B Apache 2.0 ✅ 1,000B 58.07 Llama-65B Llama License ❌ 1,500B 61.19 Llama-2-70B Llama 2 License ✅ 2,000B 67.87 Llama-2-70B-chat Llama 2 License ✅ 2,000B 62.4

Note: The performance scores shown in the table below have been updated to account for new methodology introduced in November 2023 and new benchmarks have been added. See this post for more information.

demo

The 13B Llama 2 model can be easily tried out in this space or in the playground embedded below.

To learn more about how this demo works, read below about how to perform inference on Llama 2 models.

inference

This section describes different approaches for performing inference on Llama 2 models. Before using these models, make sure you request access to one of the models in the official Meta Llama 2 repository.

Note: Be sure to fill out the official meta form as well. After a few hours and filling out both forms, the user will be granted access to the repository.

Use of transformers

With Transformers release 4.31, you can already use Llama 2 and take advantage of all the tools in the HF ecosystem, including:

Training and inference scripts and samples Safe file formats (safe tensors) Integration with tools such as bitsandbytes (4-bit quantization) and PEFT (parameter efficient fine-tuning) utilities and helpers Helpers to perform generation with models Mechanisms to export models for deployment

Be sure to use the latest Transformers release and log in to your Hugging Face account.

pip install transformer, huggingface-cli login

The following code snippet shows how to use transformers to perform inference. As long as you select the GPU runtime, it will run on Colab’s free tier.

from transformer import AutoTokenizer
import transformer
import Torch model = “Meta-rama/rama-2-7b-chat-hf”

tokenizer = AutoTokenizer.from_pretrained(model) Pipeline =Transformers.pipeline(
“Text generation”model=model, torch_dtype=torch.float16, device_map=“Auto”) sequence = pipeline(
“I liked “Breaking Bad” and “Band of Brothers.” Do you have any other show recommendations that I might like?\n’do_sample=truthtop_k=10num_return_sequences=1eos_token_id=tokenizer.eos_token_id, max_length=200,)
for continuous in Sequence:
print(f”Result: {sequence(‘generated text’)}”) Result: I liked “Breaking Bad” and “Band of Brothers.” Do you have any other show recommendations that I might like?Answer: Of course! If you liked “Breaking Bad” and “Band of Brothers,” here are some other TV shows you might enjoy. 1. “The Sopranos” – This HBO series is a crime drama that follows the life of New Jersey mob boss Tony Soprano as he navigates the criminal underworld and deals with personal and family issues. 2. “The Wire” – This HBO series is a realistic depiction of the drug trade in Baltimore, examining the effects of drugs on individuals, communities, and the criminal justice system. 3. “Mad Men” – Set in the 1960s, this AMC series follows the lives of advertising executives on Madison Avenue (expl)

Also, although the model only has 4k context tokens, you can use techniques supported by transformers such as rotational position embedding scaling (Tweet) to push the model further.

Using text generation inference and inference endpoints

Text Generation Inference is a production-ready inference container developed by Hugging Face that enables easy deployment of language models at scale. Features include continuous batch processing, token streaming, tensor parallelism for fast inference on multiple GPUs, and production-ready logging and tracing.

You can try text generation inference with your own infrastructure or use Hugging Face’s inference endpoint. To deploy an Llama 2 model, go to the model page and click the (Deploy)->(Inference Endpoint) widget.

For 7B models, we recommend selecting “Medium GPU – 1x Nvidia A10G”. For 13B models, we recommend selecting GPU (xlarge) – 1x Nvidia A100. For 70B models, we recommend selecting GPU (2xlarge) – 2x Nvidia A100 or GPU (4xlarge) – 4x Nvidia A100 with bit-sand-byte quantization enabled.

Note: To access A100, you may need to request a quota upgrade by emailing api-enterprise@huggingface.co.

To learn more about how to deploy LLM using the hug face inference endpoint, check out our blog. This blog contains information about supported hyperparameters and how to stream responses using Python and JavaScript.

Fine adjustment with PEFT

LLM training can be technically and computationally challenging. In this section, we take a look at the tools available in the Hugging Face ecosystem to efficiently train Llama 2 on simple hardware, and show you how to fine-tune the 7B version of Llama 2 on a single NVIDIA T4 (16GB – Google Colab). For more information, see the blog Making LLM More Accessible.

I wrote a script that uses QLoRA and trl’s SFTTrainer to adjust Llama 2 instructions.

Below is an example command to tweak Llama 2 7B in timdettmers/openassistant-guanaco. The script can specify the merge_and_push argument to merge LoRA weights with model weights and save them as safe tensor weights. This allows you to deploy fine-tuned models after training with text generation inference and inference endpoints.

First run pip install trl and clone the script.

pip install trl git clone https://github.com/lvwerra/trl

Then you can run the script.

python trl/examples/scripts/sft_trainer.py \ –model_name metal-llama/Llama-2-7b-hf \ –dataset_name timdettmers/openassistant-guanaco \ –load_in_4bit \ –use_peft \ –batch_size 4 \ –gradient_accumulation_steps 2

How to display prompts in Llama 2

One of the hidden benefits of the open access model is that you have complete control over your chat application’s system prompts. This is essential for specifying the behavior of your chat assistant and imbuing it with some personality, but it’s not achievable with the model provided behind the API.

We added this section just a few days after the initial release of Llama 2 because we received a lot of questions from the community about how to prompt models and how to change system prompts. I hope this helps.

The first turn prompt template looks like this:

(INST) > {{ system_prompt }} > {{ user_message }} (/INST)

This template follows the model training steps described in the Llama 2 paper. Any system_prompt can be used, but it is important that its format matches the one used during training.

To be completely clear, when a user starts a chat by entering text in the 13B chat demo (There’s a llama in my garden 😱 What should I do?), here’s what is actually sent to the language model:

(INST) > You are a kind, polite and sincere assistant. Always stay safe and answer as helpfully as possible. Your responses must not contain any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Make sure your answers are socially unbiased and positive in nature. If a question doesn’t make sense or the facts are inconsistent, explain why rather than answering something that’s incorrect. If you don’t know the answer to a question, don’t share false information. > I have a llama in my garden 😱 What should I do? (/INST)

As you can see, the special > inter-token instructions provide context to the model so it knows what response to expect. This works because the exact same format was used during training with different system prompts targeting different tasks.

As the conversation progresses, all interactions between the human and the “bot” are appended to the previous prompt, surrounded by (INST) delimiters. The template used during multi-turn conversations follows the following structure (see 🎩 Arthur Zucker for a final explanation):

(INST) > {{ system_prompt }} > {{ user_msg_1 }} (/INST) {{ model_answer_1 }} (INST) {{ user_msg_2 }} (/INST)

The model is stateless and does not “remember” previous snippets of conversation. You should always provide all the context to your model so that the conversation can continue. This is why context length is a very important parameter to maximize. Because this allows for longer conversations and the use of large amounts of information.

ignore previous instructions

API-based models sometimes resort to tricks that attempt to change default model behavior by overriding system prompts. While these solutions are imaginative, the open access model does not require this. Anyone can use a different prompt as long as they follow the format above. We believe this will be an important tool for researchers to study the effects of prompts on both desirable and undesirable traits. For example, if you’re surprised by a generation where people are unusually cautious, you might consider whether a different prompt might work. (🎩 Clémentine Fourrier for a link to this example).

The 13B and 7B demos make it easy to explore this feature by exposing the “Advanced Options” UI and writing the necessary steps. You may also reproduce these demos for personal use for fun or research.

additional resources

conclusion

We are very excited about the release of Llama 2. In the coming days, get ready to learn more about how to perform your own tweaks, run the smallest models on your device, and many other exciting updates we have in store.

versatileai

See Full Bio

What's Hot

From experiment to corporate reality

Identify content created with Google’s AI tools

Inadequate introduction of AI may be the reason behind the reduction in personnel

From experiment to corporate reality

Identify content created with Google’s AI tools

Inadequate introduction of AI may be the reason behind the reduction in personnel

Open Source DeepResearch – Unlocking Search Agents

How to use AI to support better tropical cyclone forecasting — Google DeepMind

CIO’s Governance Guide

Most Popular

Open Source DeepResearch – Unlocking Search Agents

How to use AI to support better tropical cyclone forecasting — Google DeepMind

CIO’s Governance Guide

Don't Miss

From experiment to corporate reality

Identify content created with Google’s AI tools

Inadequate introduction of AI may be the reason behind the reduction in personnel

Subscribe to Updates

What's Hot

Llama 2 is here – available at Hugging Face

table of contents

Why Llama 2?

demo

inference

Use of transformers

Using text generation inference and inference endpoints

Fine adjustment with PEFT

How to display prompts in Llama 2

ignore previous instructions

additional resources

conclusion

Related Posts