🚀 FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

Wenliang Zhao* Minglei Shi* Xumin Yu Jie Zhou Jiwen Lu

Tsinghua University

We propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor’s outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. FlowTurbo is efficient in both training (<6 GPU hours) and inference (~40ms / img)

Figure 1: Visualization of the curvatures of the sampling trajectories of different models. We compare the curvatures of the model predictions of a standard diffusion model (DiT) and several flow-based models (SiT, SD3-Medium, FLUX.1-dev, and Open-Sora) during the sampling. We observe that the vθ in flow-based models is much more stable than ϵ of diffusion models during the sampling, which motivates us to seek a more lightweight estimation model to reduce the sampling costs of flow-based generative models.

Abstract

Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor's outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1%∼58.3% on class-conditional generation and 29.8%∼38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art.

Approach

Figure 2: Overview of FlowTurbo. (a) Motivated by the stability of the velocity predictor’s outputs during the sampling, we propose to learn a lightweight velocity refiner to regress the offset of the velocity field. (b)(c) We propose the pseudo corrector which leverages a velocity cache to reduce the number of model evaluations while maintaining the same convergence order as Heun’s method. (d) During sampling, we employ a combination of Heun’s method, the pseudo corrector, and the velocity refiner, where each sample block is processed with the proposed sample-aware compilation.

Results

In our experiments, we consider two widely used benchmarks including class-conditional image generation and text-to-image generation. For class-conditional image generation, we adopt a transformer-style flow-based model SiT-XL pre-trained on ImageNet 256×256. For text-to-image generation, we utilize InstaFlow as the flow-based model, whose backbone is a U-Net similar to Stable-Diffusion.
Notably, FlowTurbo can significantly improve over the baseline SiT-XL and achieves the fastest sampling (38 ms / img) and the best quality (2.12 FID) with different configurations.

Table 1: Main results. We apply our FlowTurbo on SiT-XL and the 2-RF of InstaFlow to perform class-conditional image generation and text-to-image generation, respectively. The image quality is measured by the FID 50K↓ on ImageNet (256×256) and the FID 5K↓ on MS COCO 2017 (512×512). We use the suffix to represent the number of Heun’s method block (H), pseudo corrector block (P), and the velocity refiner block (R). Our results demonstrate that FlowTurbo can significantly accelerate the inference of flow-based models while achieving better sampling quality.

Table 2: Comparisons with the state-of-the-arts. We compare the sampling quality and speed of different methods on ImageNet 256 × 256 class-conditional sampling. We demonstrate that FlowTurbo can significantly improve over the baseline SiT-XL and achieves the fastest sampling (38 ms / img) and the best quality (2.12 FID) with different configurations.

BibTeX

@article{zhao2024flowturbo,

title={FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner},

author={Zhao, Wenliang and Shi, Minglei and Yu, Xumin and Zhou, Jie and Lu, Jiwen},

journal={NeurIPS},

year={2024}

}

Page updated

Google Sites

Report abuse