toread

CCMat 's Collections

3D Agents

3D Understanding

3D World / Scene

LoRA

Visual Consistency

ID Preservation

Inference Improvements

Transformers & Attention

VLM

updated Apr 29, 2025

Upvote

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Paper • 2405.08748 • Published May 14, 2024 • 23
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 31
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 134
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published May 20, 2024 • 41
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20, 2024 • 50
FIFO-Diffusion: Generating Infinite Videos from Text without Training

Paper • 2405.11473 • Published May 19, 2024 • 56
Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19, 2024 • 157
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24, 2024 • 46
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5, 2024 • 71
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

Paper • 2406.04333 • Published Jun 6, 2024 • 38
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 74
Block Transformer: Global-to-Local Language Modeling for Fast Inference

Paper • 2406.02657 • Published Jun 4, 2024 • 41
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Paper • 2307.06304 • Published Jul 12, 2023 • 35
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22, 2024 • 126
Multi-Head Mixture-of-Experts

Paper • 2404.15045 • Published Apr 23, 2024 • 60
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
Phased Consistency Model

Paper • 2405.18407 • Published May 28, 2024 • 48
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 68
CRAG -- Comprehensive RAG Benchmark

Paper • 2406.04744 • Published Jun 7, 2024 • 46
DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12, 2024 • 25
Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 103
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 70
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Paper • 2406.12849 • Published Jun 18, 2024 • 50
Adam-mini: Use Fewer Learning Rates To Gain More

Paper • 2406.16793 • Published Jun 24, 2024 • 69
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Paper • 2406.16855 • Published Jun 24, 2024 • 57
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1, 2024 • 44
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 94
Video Diffusion Alignment via Reward Gradients

Paper • 2407.08737 • Published Jul 11, 2024 • 49
Qwen2 Technical Report

Paper • 2407.10671 • Published Jul 15, 2024 • 171
Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Paper • 2407.20179 • Published Jul 29, 2024 • 47
Gemma 2: Improving Open Language Models at a Practical Size

Paper • 2408.00118 • Published Jul 31, 2024 • 78
The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31, 2024 • 118
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122
MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 94
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5, 2024 • 35
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5, 2024 • 62
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Paper • 2408.03361 • Published Aug 6, 2024 • 85
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

Paper • 2408.03178 • Published Aug 6, 2024 • 40
LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61
Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8, 2024 • 175
ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12, 2024 • 55
Qwen2-Audio Technical Report

Paper • 2407.10759 • Published Jul 15, 2024 • 64
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Paper • 2407.12077 • Published Jul 16, 2024 • 57
Compact Language Models via Pruning and Knowledge Distillation

Paper • 2407.14679 • Published Jul 19, 2024 • 39
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 39
KAN or MLP: A Fairer Comparison

Paper • 2407.16674 • Published Jul 23, 2024 • 43
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 30
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 29
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

Paper • 2408.02555 • Published Aug 5, 2024 • 31
Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29, 2024 • 37
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Paper • 2407.16982 • Published Jul 24, 2024 • 42
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Paper • 2408.06072 • Published Aug 12, 2024 • 38
Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 62
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 101
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Paper • 2408.10198 • Published Aug 19, 2024 • 35
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 63
Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 94
DreamCinema: Cinematic Transfer with Free Camera and 3D Character

Paper • 2408.12601 • Published Aug 22, 2024 • 32
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Paper • 2408.13252 • Published Aug 23, 2024 • 26
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Paper • 2408.14176 • Published Aug 26, 2024 • 62
Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26, 2024 • 44
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 86
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 50
LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper • 2409.02634 • Published Sep 4, 2024 • 97
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Paper • 2409.01322 • Published Sep 2, 2024 • 96
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Paper • 2409.03718 • Published Sep 5, 2024 • 27
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 40
Dynamic Typography: Bringing Words to Life

Paper • 2404.11614 • Published Apr 17, 2024 • 46
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22, 2024 • 259
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 80
Iterative Reasoning Preference Optimization

Paper • 2404.19733 • Published Apr 30, 2024 • 50
KAN: Kolmogorov-Arnold Networks

Paper • 2404.19756 • Published Apr 30, 2024 • 116
OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Paper • 2409.07452 • Published Sep 11, 2024 • 21
Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 72
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 30
Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 154
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 79
VideoPoet: A Large Language Model for Zero-Shot Video Generation

Paper • 2312.14125 • Published Dec 21, 2023 • 47
Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 140
Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20, 2024 • 69
Colorful Diffuse Intrinsic Image Decomposition in the Wild

Paper • 2409.13690 • Published Sep 20, 2024 • 13
Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 98
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 54
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 53
Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 43
Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 87
Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 57
Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

Paper • 2410.11795 • Published Oct 15, 2024 • 18
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 70
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Paper • 2410.17249 • Published Oct 22, 2024 • 44
Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 100
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 37
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Paper • 2410.16271 • Published Oct 21, 2024 • 84
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 56
Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Paper • 2410.17243 • Published Oct 22, 2024 • 92
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published Oct 9, 2024 • 12
Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 151
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 36
Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Paper • 2410.10792 • Published Oct 14, 2024 • 31
CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 209
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 39
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published Nov 11, 2024 • 68
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 50
Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 65
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Paper • 2411.08380 • Published Nov 13, 2024 • 25
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Paper • 2411.09595 • Published Nov 14, 2024 • 77
MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 80
LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 129
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 36
AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 24
RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 58
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 57
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 19
Stable Flow: Vital Layers for Training-Free Image Editing

Paper • 2411.14430 • Published Nov 21, 2024 • 22
Style-Friendly SNR Sampler for Style-Driven Generation

Paper • 2411.14793 • Published Nov 22, 2024 • 39
Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 53
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Paper • 2411.15466 • Published Nov 23, 2024 • 39
Material Anything: Generating Materials for Any 3D Object via Diffusion

Paper • 2411.15138 • Published Nov 22, 2024 • 50
OminiControl: Minimal and Universal Control for Diffusion Transformer

Paper • 2411.15098 • Published Nov 22, 2024 • 61
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Paper • 2411.17459 • Published Nov 26, 2024 • 12
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Paper • 2412.02259 • Published Dec 3, 2024 • 60
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Paper • 2411.19943 • Published Nov 29, 2024 • 63
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 135
SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Paper • 2412.02687 • Published Dec 3, 2024 • 113
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 34
Imagine360: Immersive 360 Video Generation from Perspective Anchor

Paper • 2412.03552 • Published Dec 4, 2024 • 29
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Paper • 2412.03515 • Published Dec 4, 2024 • 27
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Paper • 2412.01064 • Published Dec 2, 2024 • 47
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 29
Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 33
SpotLight: Shadow-Guided Object Relighting via Diffusion

Paper • 2411.18665 • Published Nov 27, 2024 • 3
Video Depth without Video Models

Paper • 2411.19189 • Published Nov 28, 2024 • 39
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Paper • 2411.18350 • Published Nov 27, 2024 • 28
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Paper • 2411.18613 • Published Nov 27, 2024 • 59
Pathways on the Image Manifold: Image Editing via Video Generation

Paper • 2411.16819 • Published Nov 25, 2024 • 37
Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Paper • 2411.17440 • Published Nov 26, 2024 • 38
ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 87
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Paper • 2412.00177 • Published Nov 29, 2024 • 8
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 118
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 62
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 71
Structured 3D Latents for Scalable and Versatile 3D Generation

Paper • 2412.01506 • Published Dec 2, 2024 • 88
A Noise is Worth Diffusion Guidance

Paper • 2412.03895 • Published Dec 5, 2024 • 29
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Paper • 2412.04146 • Published Dec 5, 2024 • 23
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 160
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Paper • 2412.04301 • Published Dec 5, 2024 • 40
APOLLO: SGD-like Memory, AdamW-level Performance

Paper • 2412.05270 • Published Dec 6, 2024 • 37
STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 74
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Paper • 2412.07774 • Published Dec 10, 2024 • 30
Video Motion Transfer with Diffusion Transformers

Paper • 2412.07776 • Published Dec 10, 2024 • 17
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Paper • 2412.07760 • Published Dec 10, 2024 • 55
StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper • 2412.07744 • Published Dec 10, 2024 • 20
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Paper • 2412.06016 • Published Dec 8, 2024 • 20
Learning Flow Fields in Attention for Controllable Person Image Generation

Paper • 2412.08486 • Published Dec 11, 2024 • 36
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 97
Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 123
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

Paper • 2412.09593 • Published Dec 12, 2024 • 18
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 28
Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 53
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Paper • 2412.13649 • Published Dec 18, 2024 • 21
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Paper • 2412.17256 • Published Dec 23, 2024 • 47
Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 377
Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 147
GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 98
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 38
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Paper • 2412.09626 • Published Dec 12, 2024 • 21
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19
Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 108
BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 36
ColorFlow: Retrieval-Augmented Image Sequence Colorization

Paper • 2412.11815 • Published Dec 16, 2024 • 26
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published Dec 18, 2024 • 25
Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

Paper • 2311.13141 • Published Nov 22, 2023 • 16
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1, 2025 • 110
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published Jan 2, 2025 • 53
LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published Dec 30, 2024 • 50
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published Jan 2, 2025 • 44
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87
1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 86
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Paper • 2412.18605 • Published Dec 24, 2024 • 21
DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 36
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published Dec 23, 2024 • 41
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Paper • 2412.11100 • Published Dec 15, 2024 • 7
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Paper • 2412.09619 • Published Dec 12, 2024 • 31
PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations

Paper • 2412.05994 • Published Dec 8, 2024 • 19
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45
StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

Paper • 2412.06234 • Published Dec 9, 2024 • 19
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Paper • 2412.07720 • Published Dec 10, 2024 • 31
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 23
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

Paper • 2411.14974 • Published Nov 22, 2024 • 15
TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 17
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Paper • 2411.16443 • Published Nov 25, 2024 • 11
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 51
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

Paper • 2411.04928 • Published Nov 7, 2024 • 56
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published Nov 7, 2024 • 71
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4, 2024 • 52
How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published Nov 4, 2024 • 34
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published Nov 4, 2024 • 25
Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 23
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 24
AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published Nov 4, 2024 • 16
GenXD: Generating Any 3D and 4D Scenes

Paper • 2411.02319 • Published Nov 4, 2024 • 20
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Paper • 2410.22366 • Published Oct 28, 2024 • 84
One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Paper • 2412.01106 • Published Dec 2, 2024 • 24
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Paper • 2411.18664 • Published Nov 27, 2024 • 24
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Paper • 2411.18552 • Published Nov 27, 2024 • 18
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

Paper • 2412.04280 • Published Dec 5, 2024 • 13
MV-Adapter: Multi-view Consistent Image Generation Made Easy

Paper • 2412.03632 • Published Dec 4, 2024 • 24
PanoDreamer: 3D Panorama Synthesis from a Single Image

Paper • 2412.04827 • Published Dec 6, 2024 • 10
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Paper • 2412.04440 • Published Dec 5, 2024 • 22
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6, 2025 • 44
Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7, 2025 • 82
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Paper • 2501.03218 • Published Jan 6, 2025 • 35
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper • 2501.02976 • Published Jan 6, 2025 • 56
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published Jan 7, 2025 • 56
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75
GameFactory: Creating New Games with Generative Interactive Videos

Paper • 2501.08325 • Published Jan 14, 2025 • 67
CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

Paper • 2501.09433 • Published Jan 16, 2025 • 18
Do generative video models learn physical principles from watching videos?

Paper • 2501.09038 • Published Jan 14, 2025 • 34
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16, 2025 • 46
Diffusion Adversarial Post-Training for One-Step Video Generation

Paper • 2501.08316 • Published Jan 14, 2025 • 36
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Paper • 2501.08326 • Published Jan 14, 2025 • 34
MangaNinja: Line Art Colorization with Precise Reference Following

Paper • 2501.08332 • Published Jan 14, 2025 • 62
VideoAuteur: Towards Long Narrative Video Generation

Paper • 2501.06173 • Published Jan 10, 2025 • 31
Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11, 2025 • 90
Evolving Deeper LLM Thinking

Paper • 2501.09891 • Published Jan 17, 2025 • 115
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

Paper • 2501.12909 • Published Jan 22, 2025 • 74
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published Jan 22, 2025 • 91
The Lessons of Developing Process Reward Models in Mathematical Reasoning

Paper • 2501.07301 • Published Jan 13, 2025 • 100
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Paper • 2412.08629 • Published Dec 11, 2024 • 13
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Paper • 2501.13200 • Published Jan 22, 2025 • 70
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper • 2311.06242 • Published Nov 10, 2023 • 95
Elucidating the Design Space of Diffusion-Based Generative Models

Paper • 2206.00364 • Published Jun 1, 2022 • 18
Improving Video Generation with Human Feedback

Paper • 2501.13918 • Published Jan 23, 2025 • 53
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Paper • 2501.13926 • Published Jan 23, 2025 • 43
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28, 2025 • 125
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Paper • 2501.16764 • Published Jan 28, 2025 • 22
MatAnyone: Stable Video Matting with Consistent Memory Propagation

Paper • 2501.14677 • Published Jan 24, 2025 • 34
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Paper • 2411.04983 • Published Nov 7, 2024 • 13
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper • 2401.10774 • Published Jan 19, 2024 • 60
SAMPart3D: Segment Any Part in 3D Objects

Paper • 2411.07184 • Published Nov 11, 2024 • 29
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

Paper • 2502.01639 • Published Feb 3, 2025 • 26

Upvote

Collection guide
Browse collections