![]()
Hugging Face provides a hub platform where you can easily upload, share, and deploy your models. Developers save the time and computational resources required to train models from scratch. However, challenges can still arise when deploying models in real production environments or in a cloud-native manner.
This is where BentoML comes into play. BentoML is an open source platform for serving and deploying machine learning models. It is an integrated framework for building, shipping, and scaling production-ready AI applications that incorporates traditional pre-trained generative models and large-scale language models. Here’s how to use the BentoML framework from a high-level perspective.
Define your model: To use BentoML, you need a machine learning model (or models). This model can be trained using machine learning libraries such as TensorFlow and PyTorch. Save the model: Once you have the trained model, save it to the BentoML local model store. This is used to locally manage all trained models and access them for providing services. Create a BentoML service: Create a service.py file to wrap your model and define your service delivery logic. Exposing an API that specifies a model runner to perform large-scale model inference and defines how inputs and outputs are processed. Build Bento: Package all your models and services into Bento, a deployable artifact that includes all your code and dependencies, by creating a configuration YAML file. Deploying Bento: Once Bento is ready, you can containerize it to create a Docker image and run it on Kubernetes. Alternatively, deploy Bento directly to Yatai. Yatai is an open source, end-to-end solution for automating and running machine learning deployments on Kubernetes at scale.
In this blog post, we will show you how to integrate DeepFloyd IF with BentoML by following the above workflow.
table of contents
A brief introduction to DeepFloyd IF
DeepFloyd IF is a state-of-the-art open source text-to-image conversion model. It distinguishes itself from potential diffusion models such as stable diffusion due to its unique operational strategy and architecture.
DeepFloyd IF provides advanced photorealism and sophisticated language understanding. Unlike Stable Diffusion, DeepFloyd IF operates directly in pixel space and leverages a modular structure that includes a frozen text encoder and three cascading pixel diffusion modules. Each module plays a unique role within the process. Stage 1 is responsible for creating a base image of 64×64 pixels, which is then gradually upscaled to 1024×1024 pixels in stages 2 and 3. Another key aspect of DeepFloyd IF’s uniqueness is its integration of a large-scale language model (T5-XXL-1.1) for encoding prompts, which enables superior understanding of complex prompts. For more information, see Stability AI’s blog post on DeepFloyd IF.
To ensure that your DeepFloyd IF application performs well in production, you may need to allocate and manage your resources wisely. In this regard, BentoML allows you to scale runners independently for each stage. For example, you can use more pods for the Stage 1 runner or assign more powerful GPU servers to the Stage 1 runner.
Preparing the environment
This GitHub repository stores all the files needed for this project. To run this project locally, make sure you have the following:
At least 2x16GB VRAM GPU or 1×40 VRAM GPU with Python 3.8+ pip installed. For this project, we used a Google Cloud n1-standard-16 type machine with 64 GB of RAM and two NVIDIA T4 GPUs. Note that while it is possible to run IF on a single T4, it is not recommended for production-grade services.
Once the prerequisites are met, clone the project repository to your local machine and move it to the target directory.
git clone https://github.com/bentoml/IF-multi-GPUs-demo.git
CD IF-Multi-GPU-Demo
Before building the application, let’s quickly examine the main files in this directory.
import_models.py: Define models for each stage of IFPipeline. Use this file to download all models to your local machine so they can be packaged into a single Bento. requirements.txt: Defines all packages and dependencies required for this project. service.py: Defines the BentoML service. It contains three runners created using the to_runner method and exposes an API for generating images. The API takes JSON objects as input (i.e., prompts and negative prompts) and uses a set of models to return images as output. Starts the BentoML HTTP server through the service defined in start-server.py:service.py and creates a Gradio web interface for users to prompt to generate images. Bentofile.yaml: Defines metadata for Bento to build, such as services, Python packages, and models.
We recommend creating virtual environments to isolate dependencies. For example, run the following command to activate myenv.
Python -m venv venv
sauce venv/bin/activate
Install the required dependencies.
pip install -r requirements.txt
If you have never downloaded a model from Hugging Face using the command line, you must first log in.
pip install -U hugface_hub hugface-cli login
Downloading a model to the BentoML model store
As mentioned above, you need to download all the models used in each IF stage of DeepFloyd. After setting up your environment, run the following command to download the model to your local model store. This process may take some time.
python import_models.py
Once the download is complete, view the model in the model store.
$ Bentml Model List Tag Module Size Creation Time sd-upscaler:bb2ckpa3uoypynry Bentml.diffusers 16.29 GiB 2023-07-06 10:15:53 if-stage2:v1.0 Bentml.diffusers 13.63 GiB 2023-07-06 09:55:49 if-stage1:v1.0 Bentml.diffusers 19.33 GiB 2023-07-06 09:37:59
Starting the BentoML service
The application’s entry point, the start-server.py file, allows you to run the BentoML HTTP server directly in the Gradio-powered web UI. It provides various options to customize execution and manage GPU allocation between different stages. Different commands are available depending on your GPU settings.
For GPUs with more than 40GB VRAM, run all models on the same GPU.
python start-server.py
For two Tesla T4s with 15 GB VRAM each, assign the stage 1 model to the first GPU, and assign the stage 2 and stage 3 models to the second GPU.
python start-server.py –stage1-gpu=0 –stage2-gpu=1 –stage3-gpu=1
For one Tesla T4 with 15GB VRAM and two additional GPUs with smaller VRAM sizes, assign the stage 1 model to the T4, and assign the stage 2 and stage 3 models to the second and third GPUs, respectively.
python start-server.py –stage1-gpu=0 –stage2-gpu=1 –stage3-gpu=2
To see all customizable options (such as server ports), run:
python start-server.py —help
Testing the server
Once the server is started, you can access the web UI at http://localhost:7860. The BentoML API endpoint can also be accessed at http://localhost:3000. Here are some examples of prompts and negative prompts:
prompt:
Orange and black, headshot of a woman standing under a street lamp, dark theme, Frank Miller, movie, surreal, atmospheric, highly detailed and intricate, surreal, 8k resolution, photorealistic, highly textured, intricately detailed
Negative prompt:
tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, framed out, mutated, mutated, extra limbs, extra legs, extra arms, deformed, deformed, cross-eyed, out-of-frame body, blurred, bad art, bad anatomy, blurred, text, watermark, grainy
result:

Make and serve lunch boxes
Now that you have successfully run DeepFloyd IF locally, you can package it into Bento by running the following command in your project directory:
$ Bentml build conversion “IF-Stage 1” To lowercase: “if-stage 1”. converting “IF-Stage 2” To lowercase: “if stage 2”. Convert DeepFloyd-IF to lowercase: deepfloyd-if. Building a BentoML service “Deep Floyd-if:6ufnybq3vwszgnry” from build context “/Users/xxx/Documents/github/IF-multi-GPUs-demo”. packing model “sd-upscaler:bb2ckpa3uoypynry”
packing model “if-stage1:v1.0”
packing model “if-stage2:v1.0”
Lock the version of a PyPI package. ██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░ ██╔══██╗██╔════╝████╗░██║╚══██ ╔══╝██╔══██╗████╗░████║██║░░░░░ ███ ██ ██╔══██╗██╔══╝░░██║╚████║░░░██ ║░░░██║░░██║██║╚██╔╝██║██║░░░░░ ███ ██ ╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═ ╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝ Bento was successfully built (tag=“Deep Floyd-if:6ufnybq3vwszgnry”).
Check out Bento at your local Bento Store.
$ Bentml List Tag Size Creation Time deepfloyd-if:6ufnybq3vwszgnry 49.25 GiB 2023-07-06 11:34:52
Bento is now ready to be delivered in production.
Bentml provides deepfloyd-if:6ufnybq3vwszgnry
To deploy Bento in a more cloud-native way, run the following command to generate a Docker image.
Bentml containerization deepfloyd-if:6ufnybq3vwszgnry
You can then deploy your model to Kubernetes.
What’s next?
BentoML provides a powerful and easy way to deploy Hugging Face models into production. With support for a wide range of ML frameworks and easy-to-use APIs, you can ship your models to production quickly. Whether you’re using the DeepFloyd IF model or any other model on the Hugging Face Model Hub, BentoML can help bring your model to life.
To see what you can build with BentoML and its ecosystem of tools, check out these resources: Stay tuned for more information about BentoML.

