Falcon Mamba is a new model by the Technology Innovation Institute (TII) in Abu Dhabi, released under the TII Falcon Mamba 7b license 1.0. This model is open access and can be used within the embracing face ecosystem here for anyone to use for research and application purposes.
This blog explains the design decisions behind the model, how a model is competitive with regard to other existing SOTA models, and how to use it within the embracing face ecosystem.
First generic large-scale pure mamba model
Trans, based on attention mechanisms, is the dominant architecture used in all the strongest language models today. However, attention mechanisms are fundamentally limited in large sequence processing due to increased computational and memory costs of sequence length. Various alternative architectures, particularly the state’s Space Language Model (SSLMS), have attempted to address the scaling limitations of sequences, but have returned to performance compared to SOTA trans.
We demonstrate that using Falcon Mamba allows you to actually overcome sequence scaling limitations without losing performance. Falcon Mamba is based on the original Mamba architecture proposed in Mamba. Linear time sequence modeling using selective state spaces, add an additional RMS normalization layer to ensure stable training of scale. This choice of structure guarantees a falcon mamba.
It can process sequences of any length without fitting memory storage, especially a single A10 24GB GPU. It takes some time to generate a new token regardless of the size of the context (see this section)
Model Training
Falcon Mamba was trained with approximately 5500gt of data, consisting of sophisticated web data with high quality technical data and code data from public sources. A constant learning rate was used for most of the training, followed by a relatively short attenuation stage of learning rates. This final stage added a small portion of the high quality curated data to further improve the performance of the model.
evaluation
Using the LM-Evaluation Harness package, evaluate the model on all benchmarks in the new leaderboard version, and normalize the evaluation results with face score normalization.
Model name: haseval bbh Math lvl5 gpqa musr mmlu-pro average SSM model: Falcon Mamba-7b 33.36 19.88 3.63 8.05 10.86 14.47 15.04 Tri-Ml/Mamba-7B-RW* 22.46 6.71 0.45 30.76 14.80 4.83 4.70 6.60 17.88 13.20 ZYPHRA/ZAMBA-7B-V1* 24.06 21.12 3.32 3.03 7.74 16.02 12.55 Transformer model: FALCON2-11B 32.61 21.94 2.34 2.80 7.53 15.44 13.44 13.78Meta3.55555555555555555555555555555555 7.38 6.24 24.55 13.41Metalama-3.1-8b 12.70 25.29 4.61 6.15 8.98 24.95 13.78 Mistral-7B-V0.1 23.86 22.02 2.49 5.59 10.68 22.36 14.50 MISTRAL-NEMO-2407 (129.37 (129.37) 6.52 27.46 15.08 GEMMA-7B 26.59 21.12 6.42 4.92 10.98 21.64 15.28
You also use LightEval to evaluate the model on the benchmarks of the first version of the LLM leaderboard.
Model Name ARC HELLASWAG MMLU WINOGRANDE TRUSHFFULQA GSM8K Average Pure SSM Model FALCON MAMBA-7B* 62.03 80.82 62.11 73.64 53.42 52.54 64.09 TRI-ML/MAMBA-7B-RW* 51.25 80.85 33.41 71.11 32.08 SSM-Attention Model RecurrentGemma-9B ** 52.00 80.40 60.50 73.60 38.60 42.60 57.95 Zyphra/Zamba-7B-V1* 56.14 78.30 52.56 53.83 64.28 Meta-llama-3-8b 60.24 82.23 66.70 78.45 42.93 45.19 62.62 Meta-lama-3.1-8B 58.53 82.13 66.43 74.35 44.29 47.92 62.128 59.98 83.31 64.16 78.37 42.15 37.83 60.97 GEMMA-7B 61.09 82.20 64.56 79.01 44.79 50.87 63.75
For models marked with stars, we internally evaluated the task, but for models marked with two stars, the results were obtained from paper or model cards.
Processing large sequences
Following the theoretical efficiency SSM model in processing large sequences, we perform a comparison of memory usage and generation throughput between falcon mamba and common transformer models using an optimal benchmark library. For fair comparison, we rescaled the vocabulary sizes of all transformer models to match Falcon Mamba to have a significant impact on the memory requirements of the model.
Before proceeding to the results, let’s first explain the difference between the prompt and the sequence generation (decoding) part. As you can see, the state space model is more important than the transformer model. When a trans generates the next token, you must pay attention to the keys and values ​​of all tokens before the context. This implies linear scaling of both the memory requirements of the context length and the generation time. The state-space model is present and only stores its recurrence state. Therefore, no additional memory or time is required to generate a large sequence. This explains the advantages of SSM for transover transformers in the decoding stage, but in the prefill stage, additional effort is required to fully utilize the SSM architecture.
The standard approach to Prefill is to handle the entire prompt in parallel to fully utilize the GPU. This approach is used in optimal benchmark libraries and is called parallel prefill. Parallel Prefill must store the hidden state of each token in memory at the prompt. In the case of a transformer, this additional memory is dominated by the memory of the stored KV cache. For SSM models, no cache is required. Additionally, memory for storing hidden states is the only component proportional to the length of the prompt. As a result, memory requirements are extended at a rapid length, and the SSM model loses the ability to handle any long sequence as well as transformers.
It is to handle prompt tokens with tokens instead of parallel prefills. Similar to sequence parallelism, larger chunks of prompts can be used instead of individual tokens to improve GPU usage. Sequential prefills make little sense for transformers, but they regain the possibility of handling any long prompts by SSM models.
With these statements in mind, we first tested the maximum sequence length that could be fitted to a single 24GB A10 GPU, and the results are shown in the diagram below. The batch size is fixed at 1 and uses Float32 accuracy. Even with parallel prefills, Falcon Mamba can fit larger sequences than transformers, but sequential prills can reach their full potential and handle any long prompt.
Next, use batch size 1 and H100 GPU to measure generation throughput with a setting with a length 1 and a prompt for a generation token of up to 130k. The results are reported in the diagram below. You can see that Falcon Mamba generates all tokens with constant throughput without increasing the CUDA peak memory. For transformer models, the generation speed decreases as the peak memory increases and the number of tokens generated increases.
How to use it in hugging a face trance?
The Falcon Mamba architecture is available in the next release of the Hugging Face Transformers Library (>4.45.0). To use the model, install the latest version of the Hug face transformer or install the library from the source.
Falcon Mamba is compatible with most faces that huggage APIs you know, such as Automodelforcausallm and Pipeline.
from transformer Import autorotelforcausallm, autotokenizer model_id = “Tiiuae/Falcon-Mamba-7B”
tokenizer = autotokenizer.from_pretrained(model_id) model = automodelforcausallm.from_pretrained(model_id, torch_dtype =“Automatic”device_map =“Auto”) Input = Talknazer (“Hello World, today”return_tensors =“PT”). In (0)output = model.generate(** inputs, max_new_tokens =100do_sample =truth))
printing(tokenizer.decode(output)0), skip_special_tokens =truth)))
Because the model is large, it also supports features such as quantization of BitsandBytes, running the model with smaller GPU memory constraints.
from transformer Import autorotelforcausallm, autotokenizer, bitsandbytesconfig model_id = “Tiiuae/Falcon-Mamba-7B”
tokenizer = autotokenizer.from_pretrained(model_id)quantization_config = bitsandbytesconfig(load_in_4bit =truth)Model = automodelforcausallm.from_pretrained(model_id,Quantization_config=Quantization_config)inputs=Tokenizer(“Hello World, today”return_tensors =“PT”). In (0)output = model.generate(** inputs, max_new_tokens =100do_sample =truth))
printing(tokenizer.decode(output)0), skip_special_tokens =truth)))
We are also pleased to introduce an instruction tuning version of Falcon Mamba. This is tweaked with an additional 5 billion tokens of monitored fine-tuning (SFT) data. This expanded training enhances the model’s ability to perform educational tasks with better accuracy and effectiveness. Experience the functionality of the instructional model through the demo available here. The chat template uses the following format:
<| im_start |>User prompt<| im_end |> <| im_start |>assistant
You can also use 4-bit conversion versions of both the base model and the instruction model. To run the Quantized model, make sure you have access to a GPU that is compatible with the BitsandBytes library.
You can also benefit from faster inference using torch.compile. Once you’ve loaded the model, simply call the model = torch.compile (model).
Acknowledgments
The author of this blog post would like to thank the face team for embracing their embrace, especially for their smooth support and integration within the ecosystem.
The authors would also like to thank Tri Dao and Albert Gu for implementing and open-sourcing the MAMBA architecture in the community.