Removing an unwanted object from an image now takes just a few seconds on a smartphone.
Doing the same on a video can require tens of minutes of computation on a modern GPU.
This accessibility gap is no coincidence: video inpainting relies predominantly on diffusion models
whose computational cost grows linearly with the temporal dimension.
We propose FM²FVI, a frugal approach to video inpainting based on
Flow Matching, a recent generative modeling framework
closely related to diffusion models. Flow Matching offers several advantages:
mathematical simplicity, the ability to handle non-Gaussian source distributions,
connections to optimal transport, and most importantly, fast sampling with fewer function evaluations.
Our first contribution is a complete and modular library for training Flow Matching models,
supporting all state-of-the-art parameterizations including schedulers, sampling strategies,
loss functions, ODE solvers, and guidance modes.
We then develop two image inpainting methodologies based solely on image self-similarity,
avoiding priors from large datasets that may raise ethical or legal concerns.
Finally, we extend our most promising approach to video inpainting, demonstrating that
it is possible to achieve visually satisfying results with a model using
fewer than 500,000 parameters, trained on a single video.
This represents a significant step toward democratizing video editing
on consumer hardware while reducing energy consumption and environmental impact.