🚀 FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner
Wenliang Zhao*  Minglei Shi*  Xumin Yu   Jie Zhou  Jiwen Lu
 Tsinghua University
[Paper (arXiv)] Â Â Â [Code (GitHub)]
Wenliang Zhao*  Minglei Shi*  Xumin Yu   Jie Zhou  Jiwen Lu
 Tsinghua University
[Paper (arXiv)] Â Â Â [Code (GitHub)]
We propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor’s outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. FlowTurbo is efficient in both training (<6 GPU hours) and inference (~40ms / img)
Figure 1: Visualization of the curvatures of the sampling trajectories of different models. We compare the curvatures of the model predictions of a standard diffusion model (DiT) and several flow-based models (SiT, SD3-Medium, FLUX.1-dev, and Open-Sora) during the sampling. We observe that the vθ in flow-based models is much more stable than ϵ of diffusion models during the sampling, which motivates us to seek a more lightweight estimation model to reduce the sampling costs of flow-based generative models.
Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor's outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1%∼58.3% on class-conditional generation and 29.8%∼38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art.
Figure 2: Overview of FlowTurbo. (a) Motivated by the stability of the velocity predictor’s outputs during the sampling, we propose to learn a lightweight velocity refiner to regress the offset of the velocity field. (b)(c) We propose the pseudo corrector which leverages a velocity cache to reduce the number of model evaluations while maintaining the same convergence order as Heun’s method. (d) During sampling, we employ a combination of Heun’s method, the pseudo corrector, and the velocity refiner, where each sample block is processed with the proposed sample-aware compilation.
In our experiments, we consider two widely used benchmarks including class-conditional image generation and text-to-image generation. For class-conditional image generation, we adopt a transformer-style flow-based model SiT-XL pre-trained on ImageNet 256×256. For text-to-image generation, we utilize InstaFlow as the flow-based model, whose backbone is a U-Net similar to Stable-Diffusion.
Notably, FlowTurbo can significantly improve over the baseline SiT-XL and achieves the fastest sampling (38 ms / img) and the best quality (2.12 FID) with different configurations.
@article{zhao2024flowturbo,
  title={FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner},
  author={Zhao, Wenliang and Shi, Minglei and Yu, Xumin and Zhou, Jie and Lu, Jiwen},
  journal={NeurIPS},
  year={2024}
}