CodeGemma is a family of open-access versions of Gemma, specialized in code, and we are excited to work with Google when it was released to make it as accessible as possible.
CodeGemma comes in three flavors.
A 2B base model specializing in filling and open-ended generations. A 7B-based model trained in both code filling and natural language. 7b Instruction Model Users can chat about the code.
We’re working with Google to ensure optimal integration into the hugging face ecosystem. You can find three open access models that can be used in the hub. Some of the features and integrations released include:
Model on the hub with a model card and license. There is a version of the Transformers Library, a checkpoint used in Google’s original codebase, and a full-ecision GGUF file that the community can quantize. Transformer Integration Integration with Google Cloud Inference Endpoints Code Benchmark
table of contents
What is codegemma?
CodeGemma is a family of code specialists LLM models by Google based on pre-trained 2B and 7B Gemma checkpoints. CodeGemma is further trained primarily with 500 billion tokens of English data, mathematics and code to improve logical and mathematical inference, making it suitable for code completion and generation.
CodeGemma 2B is trained solely on code filling, and is quickly targeted at code completion and generation, especially in settings where latency and privacy are important. The CodeGemma 7B Training Mix contains code penetration data (80%) and natural language. It can be used to complete code, understand and generate code and language. CodeGemma 7B instructions have been fine-tuned for the next instructions above CodeGemma 7b. This is intended for conversational use, particularly focusing on topics of code, programming, or mathematical reasoning. All models have the same 8K token context size as their predecessors.
This image is from the original report
Evaluation results
CodeGemma-7B is better than similarly sized 7B models except Humanval’s DeepSeek-Coder-7B, a popular benchmark for evaluating code models in Python. The same applies to evaluations of other programming languages ​​such as Java, JavaScript, and Multipl-E’s C++, a translation of HumanEval. According to technical reports, this model works best with the GSM8K among the 7B models. The directive version CodeGemma-7B-It is improved in the most popular languages ​​in both Humaneval and MBPP (CF Paper Table 5). For more information, you can view the BigCode Leaderboard or the following metrics:
Model Pre-Training Size (Token) Python JavaScript 10b+ModelStarcoder 2 15b 4,000b+ 44.15 44.24CodeLlama 13b 2,500b 35.07 38.26 7b ModelDeep Seek Coder 7b 2,000b 45.83 45.9CodeGenma7b 31.8 Starcoder 2 7b 3,500b+ 34.09 35.35 Starcoderbase 7b 3,000b+ 28.37 27.35 <3b ModelCodegemma 2b 500b of Extra Training 27.28 29.94Stable Code3b 1,300b 30.72 28.75 Starcoder 2 3b 3,000b+ 31.44 35.377
Model Pre-Training Size (Token) Python JavaScript 10b+Model Code Llama 13b 2,620b 50.6 40.92Code llama 13b 2,620b 42.89 40.66 7bModel Code Genma
Below is a table of the original report with a breakdown by language.
Prompt format
CodeGemma 2B and CodeGemma 7B use penetration (code, comments, docstrings, import statements) to complete the code. CodeGemma was trained in this task using a Middle (FIM) purpose. Here we provide prefixes and suffixes as the context of completion. The following tokens are used to separate different parts of the input:
<| fim_prefix |>It’s in front of the context before it runs. <| fim_suffix |>is before the suffix. This is where the model is where the code is complete, so you need to place this token exactly where the cursor is placed in the editor. <| fim_middle |>is a prompt to invite the model to perform the generation.
In addition to these, it provides multi-file context<| file_separator |>There are also. Here is an example of how to use the Transformers section:
The CodeGemma 7B instructions follow this conversation structure and use the same prompt format as the tuning version of the base GEMMA instruction.
User knock knock model
Who is User Lambda Model Lambdas Are there any Lambdas? Who is?
As with Gemma, the easiest way to replicate this format is to use the chat templates available in the transformer.
Use CodeGemma
demo
You can easily try out the CodeGemma model (7 billion parameters!) with this space or chatbot embedded below.
Under the hood, this playground uses a Transformer implementation. You can also replicate spaces for use. Because it is self-contained, you can look up the source code and adjust it as you like.
Use a transformer
Transformers release 4.39 allows you to use CodeGemma to take advantage of all the tools within the Face Ecosystem to hug, such as:
Training and Inference Scripts and Examples Safe File Format (Safetenser) Bit and Byte (4-bit quantization), PEFT (Personal Efficient Fine Tuning), Flash Note 2 Utility and Helper Integrate with Tools to Export and Deploy Model Mechanisms
Like the Gemma model, CodeGemma is compatible with torch.compile() for important inference speedup.
Bonus: Here I created a Colab notebook to try out the models with just the touch of a button.
To use the transformer and CodeGemma, use the latest release.
PIP Installation – Upgrade Transformer
The following snippet shows how to use CodeGemma-2B to complete code using a transformer. It requires around 6 GB of RAM using FLOAT16 accuracy, making it perfectly suitable for consumer GPUs and device-on-device applications.
from transformer Import Gemmatokenizer, automodelforcausallum
Import Torch Model_id = “Google/CodeGemma-2B”
tokenizer = gemmatokenizer.from_pretrained(model_id) model = automodelforcausallm.from_pretrained(model_id, torch_dtype = torch.float16).to(“cuda:0”)prompt = ” \
<| fim_prefix |> Import dateTime
def calculate_age (birts_year):
“” “Calculates a person’s age based on the year of birth.” “
current_year = dateTime.date.today(). year
<| fim_suffix |>
Return age<| fim_middle |> \
” ‘
inputs = tokenizer(prompt, return_tensors =“PT”).to(model.device)prompt_len = inputs(“input_ids”). shape(-1)outputs = model.generate(** inputs, max_new_tokens =100))
printing(tokenizer.decode(outputs(0)(prompt_len :)))
<| fim_suffix |>Observe that the token appears where the cursor is positioned to the editor and marks the location of the generation. <| fim_prefix |>provides the context before the cursor, and the rest<| fim_middle |>is the additional context after the cursor. If the cursor is at the start or end of a file, either of them can be empty.
The previous code might return something like this:
age = current_year -birth_year <| file_separator |> test_calculate_age.py <| fim_suffix |> assert calculation_age(1990)== 33 Assert Calculate_age(1980)== 43 Assert Calculate_age(1970)== 53 Assert Calculate_age
Be aware of any additional content after correct completion. This is especially true for CodeGemma 7b. This is more verbose and tends to provide additional code or comments once completed. You should ignore everything that appears after the fim token. By providing a list of terminators for such a generation function, you can stop the generation of transformers early.
fim_prefix = ‘<| fim_prefix |>‘
fim_suffix = ‘<| fim_suffix |>>
fim_middle = ‘<| fim_middle |>>
fim_file_separator = ‘<| file_separator |>>
Terminator = tokenizer.convert_tokens_to_ids((fim_prefix, fim_middle, fim_suffix, fim_file_separator) Terminator +=(tokenizer.eos_token_id) outputs = model.generate(**input, max_new_tokens = generator100eos_token_id = terminator,)
In this case, the generation will halt as soon as the first delimiter is found.
age = current_year -birth_year <| file_separator |>
Accuracy notes
The original CodeGemma checkpoint is released with BFLOAT16 accuracy. If you load the models without showing TORCH_DTYPE, Pytorch will upcast them to float 32. Cast to Float16 is completely fine to use. It can be much faster than BFLOAT16 on certain hardware. For maximum accuracy, it is recommended to use bfloat16 rather than float32.
You can also automatically quantize the model and load it in 8-bit or 4-bit mode. The 4-bit load on the CodeGemma 7B runs around 9 GB of memory and is compatible with many consumer cards and all GPUs on Google Colab. This is how to load a power generation pipeline in 4 bits.
Pipeline = Pipeline(
“Text Generation”,model=model,model_kwargs={
“torch_dtype”:torch.float16,
“Quantization_config”:{“load_in_4bit”: truth}},)
Integration with Google Cloud
You can deploy and train Gemma to Google Cloud via Vertex AI or Google Kubernetes Engine (GKE) using text generation inference and transformers.
To expand the codegemma model from the hugging face, go to the model page and click (Expand)-> (Google Cloud). This will arrive at the Google Cloud Console. Here you can click Deploy CodeGemma in Vertex AI or GKE with text generation inference.
You can also access CodeGemma directly through the Vertex AI Model Garden.
Integrating with inference endpoints
CodeGemma can be expanded to hugging Face’s inference endpoint, which uses text-generated inference as the backend. Text Generation Inference is a production-ready inference container developed by embracing the face to allow for easy deployment of large-scale language models. Features include continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, production-enabled logging and tracing, and distributed under the Apache 2 license.
To expand the CodeGemma model, go to the model page and click (Expand)->(Inference Endpoints widget). In a previous blog post, you can learn more about the development of LLM, which hugs the endpoint of facial inference. Note that T4S does not support the BFLOAT16 format, so you must use a different GPU option.
from huggingface_hub Import Inference Client = Recommendations (Model = IE_ENDPOINT) prompt = “” \
<| fim_prefix |>import<| fim_suffix |>
For __name__ == ‘__main__’:
sys.exit(0)<| fim_middle |> \
“” “
client.text_generation(prompt = prompt)
Additional resources