
The Florence-2, released by Microsoft in June 2024, is a Foundation Vision-Language model. This model is very attractive due to its small size (0.2B and 0.7B) and powerful performance in a variety of computer vision and vision language tasks.
Florence supports many tasks out of the box, including captions, object detection, and OCR. However, the task or domain may not be supported. Alternatively, we recommend that you have better control over the output of the task model. That’s when you need to fine tune.
This post gives an example of docvqa’s fine-tuned Florence. The authors report that Florence 2 can perform visual question answering (VQA), but the released models do not include VQA functionality. Let’s see what we can do!
Pre-training details and architecture
Florence-2 architecture
Regardless of the computer vision task being run, Florence-2 formulates problems from sequence as sequence tasks. Florence-2 takes images and text as input and generates text as output. The model has a simple structure. Convert images to visual embedding using Davit Vision Encoder, and Bert to text prompts to text and location embeddings. The resulting embedding is processed by a standard encoder decoder trans architecture to generate text and location tokens. Florence-2’s strength does not stem from its architecture, but it was pre-trained from a large dataset. The authors noted that major computer vision datasets usually contain limited information – wit only contains image/caption pairs, while SA -1B contains only images and associated segmentation masks. Therefore, they decided to build a new FLD-5B dataset containing a wide range of information about each image – boxes, masks, captions, grounding. The process of creating datasets was almost automated. The authors used a ready-made task-specific model and a set of heuristics and quality checks to clean the obtained results. The result is a new dataset with over 5 billion annotations on 126 million images. This was used on the Florence-2 model pliers.
VQA’s original performance
Various methods were experimented to adapt models of VQA (visual question answering) responses. The most effective approach we found was the local prompt to explain, although not perfectly consistent with the VQA task. The captions provide descriptive information about the image, but direct question entry is not permitted. I also tested some “unsupported” prompts, such as “”, “”, “”. Unfortunately, these attempts have unavailable results.
Finely tuned DOCVQA performance
Measure performance using standard metrics from Levenshtein similarity, DOCVQA datasets. Before the fine tuning, the similarity between the model prediction and the ground truth in the validation dataset was 0, as the output was not close to ground truth. After fine-tuning with the seven epoch training set, the similarity score for the validation set was improved to 57.0. We have created a 🤗 space to demo the finely tuned model. The model works well with DOCVQA, but there is room for improvement in understanding the general documentation. However, it successfully completes the task and shows the possibility of fine-tuning in downstream tasks. To develop exceptional VQA models, we recommend using a cauldron to further fine-tune the Florence-2. I already provide the code I need for my github page.
To provide a solid example, we provide two inference results below before and after the tweak: You can also try out the model here.
Before and after fine adjustment
Fine tuning details
Before training, the author used a batch size of 2048 for the base model and a 3072 for the larger one. It also explains performance improvements when fine-tuning with frozen image encoder compared to freeze.
We conducted experiments with lower resource setups to investigate what the model could do in a more constrained fine-tuning environment. I frozen the Vision encoder and used a batch size of 6 on a single A100 GPU from Colab, or a batch size of 1 for T4. In parallel, we experimented with more resources and fine-tuned the entire model with a batch size of 64. This training process took 70 minutes on a cluster equipped with 8 H100 GPUs. This trained model can be found here.
In all cases, a small learning rate of 1E-6 was found to be beneficial for training. As learning rates increase, the model quickly overloads the training set.
Code Walkthrough
If you want to follow, you can find a checkpoint for the Colab Fine-Tuning Notebook in the DOCVQA dataset. Let’s start by installing the dependencies.
! pip install -q datasets flash_attn timm einops
hugging Loads the docvqa dataset from the hugging facehub.
Import torch
from Dataset Import load_dataset data = load_dataset(“Huggingfacem4/documentvqa”))
You can load models and processors using the Automodelforcausallm and Autoprocessor classes from the Transformers Library. Since the model uses custom code, you must pass Trust_Remote_Code = true. This has not yet been natively integrated into the transformer. It also freezes the Vision encoder to make fine tuning cheaper.
from transformer Import Automodelforcausallm, autoprocessor
Import Torch device = torch.device(“cuda” if torch.cuda.is_available() Other than that “CPU”)Model=automodelforcausallm.from_pretrained(
“Microsoft/Florence-2-Base-ft”trust_remote_code =truthrevision =‘refs/pr/6’
)“Microsoft/Florence-2-Base-ft”trust_remote_code =truthrevision =‘refs/pr/6’))
for Parameters in model.vision_tower.parameters():param.is_trainable = error
Let’s fine tune your model! Build a training Pytorch dataset to prepare a prefix for each question from the dataset.
Import torch from torch.utils.data Import Dataset
class docvqadataset(Dataset):
def __init__(Self, data): self.data = data
def __len__(self):
return Ren(self.data)
def __getitem__(Self, idx): Example = self.data(idx)question = “” +Example (‘question’)first_answer = example(“answer”) ()0) Image = Example (‘image’). Convert (“RGB”))
return Questions, first_answer, images
Next, we build a data collator that builds a training batch from the dataset samples and start training. With 40GB of memory, the A100 can fit six examples. If you are training on T4, you can use a batch size of 1.
Import OS
from torch.utils.data Import Data Loader
from TQDM Import TQDM
from transformer Import Adamw, get_scheduler
def colate_fn(batch): Questions, Answers, Images= Zip(*batch) inputs = processor(text =list(Question), Image =list(Image), return_tensors =“PT”padding =truth).TO (device)
return Input, answer train_dataset = docvqadataset(data(data)‘train’)) val_dataset = docvqadataset(data(data)“verification”)) batch_size = 6
num_workers = 0
train_loader = dataloader(train_dataset, batch_size = batch_size, collate_fn = collate_fn, num_workers = num_workers, shuffle =truth)val_loader = dataloader(val_dataset, batch_size = batch_size, collate_fn = collate_fn, num_workers = num_workers)
You can now train the model.
Epoch = 7
optimizer = adamw(model.parameters(), lr =1E-6) num_training_steps = epochs * Ren(train_loader)lr_scheduler = get_scheduler(name =“linear”optimizer = optimizer, num_warmup_steps =0num_training_steps = num_training_steps,)
for epoch in range(Epoch): model.train()train_loss = 0
i = –1
for Input, answer in tqdm(train_loader, desc =f “Training Epoch” {Epoch + 1}/{epoch}“): i += 1
input_ids = inputs(“input_ids”)pixel_values = inputs(“pixel_values”) labels = processor.tokenizer(text = answers, return_tensors =“PT”padding =truth,RETURN_TOKEN_TYPE_IDS =error) .input_ids.to(device)outputs = model(input_ids = input_ids, pixel_values = pixel_values, labels = labels) loss = outputs.loss loss. optimizer.step() lr_scheduler.step() optimizer.zero_grad() Train_loss += losem() avlain_lain_loss() train_loss += lossem Ren(train_loader)
printing(f “Average training loss: {avg_train_loss}“) Model.I’ll rate it()val_loss = 0
and torch.no_grad():
for batch in TQDM(val_loader, desc =f “Verification Epoch” {Epoch + 1}/{epoch}“): Input, Answer = batch input_ids = inputs(“input_ids”)pixel_values = inputs(“pixel_values”) labels = processor.tokenizer(text = answers, return_tensors =“PT”padding =truth,RETURN_TOKEN_TYPE_IDS =error).input_ids.to(device)outputs = model(input_ids = input_ids, pixel_values = pixel_values, labels = labels) loss = outputs.loss val_loss += loss.item()
printing(val_loss / Ren(val_loader))
You can save the model and processor by calling save_pretrained() on both objects. A fully tweaked model is here and the demo is here.
Conclusion
In this post, we demonstrated that Florence-2 can be effectively tweaked to custom datasets, achieving impressive performance in a short time and completely new tasks. This feature is especially valuable for those who want to deploy this small model to devices or use it on a cost-effective device in a production environment. The open source community is encouraged to take advantage of this tweaked tutorial and investigate the prominent possibilities of Florence-2 for a wide range of new tasks! Can’t wait to see your model on Hub!
Useful resources
Thank you to Pedro Cuenca for reviews in this blog post.