VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
[Paper] β [Project Page] β [Github]
[π€ Online Demo] β [π€ Dataset Card]
If you find VisualCloze is helpful, please consider to star β the Github Repo. Thanks!
Note: The weights in this repository are provided for our training and testing code available at github. For usage with Diffusers, please refer to the links provided above.
π° News
- [2025-5-15] π€π€π€ VisualCloze has been merged into the official pipelines of diffusers. For usage guidance, please refer to the Model Card.
π Key Features
An in-context learning based universal image generation framework.
- Support various in-domain tasks.
- Generalize to unseen tasks through in-context learning.
- Unify multiple tasks into one step and generate both target image and intermediate results.
- Support reverse-engineering a set of conditions from a target image.
π₯ Examples are shown in the project page.
π Model
We release visualcloze-384-lora.pth and visualcloze-512-lora.pth, which is trained with grid resolutions of 384 and 512, respectively. The grid resolution means that each image is resized to the area of the square of it before concatenating images into a grid layout. We use SDEdit to upsample the generated images.
π§ Installation
git clone https://github.com/lzyhha/VisualCloze
Please configure the environment via installation structions.
π Web Demo (Gradio)
To host a local gradio demo for interactive inference, run the following command:
# By default, we use the model trained under the grid resolution of 384.
python demo.py --model_path "path to downloaded visualcloze-384-lora.pth" --resolution 384
# To use the model with the grid resolution of 512, you should set the resolution parameter to 512.
python demo.py --model_path "path to downloaded visualcloze-512-lora.pth" --resolution 512
π» Custom Sampling
Note: πππ We have implemented a version of diffusers that makes it easier to use the model through pipelines of the diffusers. For usage guidance, please refer to the Model Card.
We have implement a pipeline of the visualcloze in our official codes. This can be easily used for custom reasoning. In github, we show an example of usage on virtual try-on.
from visualcloze import VisualClozeModel
model = VisualClozeModel(
model_path="the path of model weights",
resolution=384 or 512,
lora_rank=256
)
'''
grid_h:
The number of in-context examples + 1.
When without in-context example, it should be set to 1.
grid_w:
The number of images involved in a task.
In the Depth-to-Image task, it is 2.
In the Virtual Try-On, it is 3.
'''
model.set_grid_size(grid_h, grid_w)
'''
images:
List[List[PIL.Image.Image]]. A grid-layout image collection,
each row represents an in-context example or the current query,
where the current query should be placed in the last row.
The target image can be None in the input.
The other images should be the PIL Image class (Image.Image).
prompts:
List[str]. Three prompts, representing the layout prompt, task prompt,
and content prompt respectively.
'''
result = model.process_images(
images,
prompts,
)[-1] # return PIL.Image.Image
Execute the usage example and see the output in example.jpg.
python inference.py --model_path "path to downloaded visualcloze-384-lora.pth" --resolution 384
python inference.py --model_path "path to downloaded visualcloze-512-lora.pth" --resolution 512
Citation
If you find VisualCloze useful for your research and applications, please cite using this BibTeX:
@InProceedings{Li_2025_ICCV,
author = {Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming},
title = {VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {18969-18979}
}
- Downloads last month
- -
Model tree for VisualCloze/VisualCloze
Base model
black-forest-labs/FLUX.1-Fill-dev