The ability to quickly generate high-quality images is important for creating realistic, simulated environments that can be used to train self-driving cars to avoid unpredictable dangers.
However, there are drawbacks to the generative artificial intelligence technology increasingly used to create such images. A popular type of model called diffusion models can create surprisingly realistic images, but in many applications it is too slow and computationally concentrated. On the other hand, autoregressive models that power LLMs like ChatGpt are much faster, but produce poor images of quality that are often packed with errors.
Researchers at MIT and Nvidia have developed a new approach that brings together the best methods of both. Their hybrid image generation tool uses automatic regression models to quickly capture the big picture and then capture small spreading models to improve image details.
A tool known as Hart (short for Hybrid Autorafe Restaurants) can produce images that match or exceed the quality of cutting-edge diffusion models, but is about 9 times faster.
The generation process consumes less computational resources than typical diffusion models, so HART can be run locally on a commercial laptop or smartphone. The user simply enters one natural language prompt into the HART interface to generate the image.
Hart can have a wide range of applications, such as helping researchers train robots to complete complex real-world tasks, or helping designers to create impressive scenes in video games.
“If you’re drawing a landscape and just drawing the entire canvas once, it may not look good, but if you draw the big picture and then refine the image with a small brush stroke, the painting may look much better.
He is joined by co-star Yecheng Wu, an undergraduate student at Tsinghua University. Senior author Song Han, Associate Professor at MIT Bureau of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab and a well-known Nvidia scientist. So do other people at MIT, Tsinghua University and Nvidia. This research will be presented at the International Conference on Learning Expression.
The best of both worlds
Common diffusion models such as stable diffusion and Dall-E are known to produce highly detailed images. These models predict some degree of random noise at each pixel, subtract the noise, and repeat the process of “non-noise” multiple times until a new image with no noise is produced, subtract the noise, and repeat.
The process is slow and computationally expensive as the diffusion model removes all pixels in the image of each step and there can be over 30 steps. However, the image is of high quality as the model has multiple possibilities to modify the details.
Autoregressive models commonly used to predict text can generate images at several pixels at a time by predicting patches of images in order. They cannot go back and correct the mistakes, but the sequential prediction process is much faster than spreading.
These models use expressions known as tokens to make predictions. Autorafe models use an autoencoder to compress raw image pixels into discrete tokens and reconstruct the image from the predicted tokens. This increases the speed of the model, but the information loss that occurs during compression causes errors when the model generates a new image.
Using Hart, researchers developed a hybrid approach to predict compressed discrete image tokens using automatic regression models, and then predicted small diffusion models to predict residual tokens. Remaining tokens compensate for the loss of information in the model by capturing the details left by individual tokens.
“We get a big boost in terms of quality of reconstruction. The rest of our tokens learn about high frequency details, such as the edges of objects, the hair, eyes, and mouth of people. These are where individual tokens can make mistakes.”
The diffusion model predicts remaining details after the autoregressive model has performed the job, so it can accomplish the task in the eight steps required to generate the entire image, rather than the usual standard diffusion model of 30 or more. This minimal overhead of additional diffusion models allows HART to retain the speed advantage of autoregressive models and significantly improve its ability to generate complex image details.
“The diffusion model does a much easier job, which makes it more efficient,” he adds.
Better than the larger model
During the development of HART, researchers encountered challenges in effectively integrating diffusion models to enhance autoregressive models. They found that incorporating diffusion models into the early stages of the autoregressive process accumulates errors. Instead, the final design, which applied a diffusion model to predict only residual tokens, has significantly improved the quality of production by the final step.
That method, using a combination of an auto-leffe transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters, can produce images of the same quality as those created by a diffusion model with 2 billion parameters, but about 9 times faster. It uses approximately 31% less calculations than cutting-edge models.
Furthermore, Hart uses autoregressive models to perform most of the work that is the same type of model that enhances LLM, making it compatible with integration with a new class of unified vision language generation models. In the future, you can interact with a unified vision language generation model, perhaps by asking them to show the intermediate steps needed to assemble furniture.
“LLM is an interface suitable for all kinds of models, such as multimodal models and inferable models. It’s a way to push intelligence to new frontiers. An efficient image generation model unlocks many possibilities,” he says.
In the future, researchers hope to go this path and build vision language models on top of the HART architecture. Hart is scalable and generalizable to multiple modalities, and would like to apply it to video generation and audio prediction tasks as well.
This study was funded in part by the MIT-IBM Watson AI Lab, MIT and Amazon Science Hub, the MIT AI Hardware Program, and the National Science Foundation. The GPU infrastructure to train this model was donated by Nvidia.