Deprecated : The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
toread - a CCMat Collection
toread updated Apr 29, 2025
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding Paper
• 2405.08748
• Published May 14, 2024 • 23
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection Paper
• 2405.10300
• Published May 16, 2024 • 31
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper
• 2405.09818
• Published May 16, 2024 • 134
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper
• 2405.11143
• Published May 20, 2024 • 41
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning Paper
• 2405.12130
• Published May 20, 2024 • 50
FIFO-Diffusion: Generating Infinite Videos from Text without Training Paper
• 2405.11473
• Published May 19, 2024 • 56
Your Transformer is Secretly Linear Paper
• 2405.12250
• Published May 19, 2024 • 157
Matryoshka Multimodal Models Paper
• 2405.17430
• Published May 27, 2024 • 34
An Introduction to Vision-Language Modeling Paper
• 2405.17247
• Published May 27, 2024 • 90
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models Paper
• 2405.15738
• Published May 24, 2024 • 46
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Paper
• 2403.03206
• Published Mar 5, 2024 • 71
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model Paper
• 2406.04333
• Published Jun 6, 2024 • 38
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions Paper
• 2406.04325
• Published Jun 6, 2024 • 74
Block Transformer: Global-to-Local Language Modeling for Fast Inference Paper
• 2406.02657
• Published Jun 4, 2024 • 41
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
Resolution Paper
• 2307.06304
• Published Jul 12, 2023 • 35
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework Paper
• 2404.14619
• Published Apr 22, 2024 • 126
Multi-Head Mixture-of-Experts Paper
• 2404.15045
• Published Apr 23, 2024 • 60
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models Paper
• 2405.15574
• Published May 24, 2024 • 55
Paper
• 2405.18407
• Published May 28, 2024 • 48
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality Paper
• 2405.21060
• Published May 31, 2024 • 68
CRAG -- Comprehensive RAG Benchmark Paper
• 2406.04744
• Published Jun 7, 2024 • 46
DiTFastAttn: Attention Compression for Diffusion Transformer Models Paper
• 2406.08552
• Published Jun 12, 2024 • 25
Paper
• 2406.09414
• Published Jun 13, 2024 • 103
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels Paper
• 2406.09415
• Published Jun 13, 2024 • 51
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing Paper
• 2406.10601
• Published Jun 15, 2024 • 70
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective
Distillation and Unlabeled Data Augmentation Paper
• 2406.12849
• Published Jun 18, 2024 • 50
Adam-mini: Use Fewer Learning Rates To Gain More Paper
• 2406.16793
• Published Jun 24, 2024 • 69
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation Paper
• 2406.16855
• Published Jun 24, 2024 • 57
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion Paper
• 2407.01392
• Published Jul 1, 2024 • 44
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output Paper
• 2407.03320
• Published Jul 3, 2024 • 94
Video Diffusion Alignment via Reward Gradients Paper
• 2407.08737
• Published Jul 11, 2024 • 49
Paper
• 2407.10671
• Published Jul 15, 2024 • 171
Theia: Distilling Diverse Vision Foundation Models for Robot Learning Paper
• 2407.20179
• Published Jul 29, 2024 • 47
Gemma 2: Improving Open Language Models at a Practical Size Paper
• 2408.00118
• Published Jul 31, 2024 • 78
The Llama 3 Herd of Models Paper
• 2407.21783
• Published Jul 31, 2024 • 118
SAM 2: Segment Anything in Images and Videos Paper
• 2408.00714
• Published Aug 1, 2024 • 122
MiniCPM-V: A GPT-4V Level MLLM on Your Phone Paper
• 2408.01800
• Published Aug 3, 2024 • 94
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining Paper
• 2408.02657
• Published Aug 5, 2024 • 35
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models Paper
• 2408.02718
• Published Aug 5, 2024 • 62
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI Paper
• 2408.03361
• Published Aug 6, 2024 • 85
An Object is Worth 64x64 Pixels: Generating 3D Object via Image
Diffusion Paper
• 2408.03178
• Published Aug 6, 2024 • 40
LLaVA-OneVision: Easy Visual Task Transfer Paper
• 2408.03326
• Published Aug 6, 2024 • 61
Transformer Explainer: Interactive Learning of Text-Generative Models Paper
• 2408.04619
• Published Aug 8, 2024 • 175
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation Paper
• 2408.06070
• Published Aug 12, 2024 • 55
Qwen2-Audio Technical Report Paper
• 2407.10759
• Published Jul 15, 2024 • 64
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression Paper
• 2407.12077
• Published Jul 16, 2024 • 57
Compact Language Models via Pruning and Knowledge Distillation Paper
• 2407.14679
• Published Jul 19, 2024 • 39
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models Paper
• 2407.15841
• Published Jul 22, 2024 • 39
KAN or MLP: A Fairer Comparison Paper
• 2407.16674
• Published Jul 23, 2024 • 43
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence Paper
• 2407.16655
• Published Jul 23, 2024 • 30
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person Paper
• 2407.16224
• Published Jul 23, 2024 • 29
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh
Tokenization Paper
• 2408.02555
• Published Aug 5, 2024 • 31
Mixture of Nested Experts: Adaptive Processing of Visual Tokens Paper
• 2407.19985
• Published Jul 29, 2024 • 37
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model Paper
• 2407.16982
• Published Jul 24, 2024 • 42
VILA^2: VILA Augmented VILA Paper
• 2407.17453
• Published Jul 24, 2024 • 41
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer Paper
• 2408.06072
• Published Aug 12, 2024 • 38
Paper
• 2408.07009
• Published Aug 13, 2024 • 62
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper
• 2408.08872
• Published Aug 16, 2024 • 101
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction
Model Paper
• 2408.10198
• Published Aug 19, 2024 • 35
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model Paper
• 2408.11039
• Published Aug 20, 2024 • 63
Sapiens: Foundation for Human Vision Models Paper
• 2408.12569
• Published Aug 22, 2024 • 94
DreamCinema: Cinematic Transfer with Free Camera and 3D Character Paper
• 2408.12601
• Published Aug 22, 2024 • 32
Building and better understanding vision-language models: insights and
future directions Paper
• 2408.12637
• Published Aug 22, 2024 • 133
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation Paper
• 2408.13252
• Published Aug 23, 2024 • 26
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its
Teacher Paper
• 2408.14176
• Published Aug 26, 2024 • 62
Foundation Models for Music: A Survey Paper
• 2408.14340
• Published Aug 26, 2024 • 44
Diffusion Models Are Real-Time Game Engines Paper
• 2408.14837
• Published Aug 27, 2024 • 126
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders Paper
• 2408.15998
• Published Aug 28, 2024 • 86
CogVLM2: Visual Language Models for Image and Video Understanding Paper
• 2408.16500
• Published Aug 29, 2024 • 57
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling Paper
• 2408.16532
• Published Aug 29, 2024 • 50
LinFusion: 1 GPU, 1 Minute, 16K Image Paper
• 2409.02097
• Published Sep 3, 2024 • 34
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion
Dependency Paper
• 2409.02634
• Published Sep 4, 2024 • 97
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing Paper
• 2409.01322
• Published Sep 2, 2024 • 96
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with
Image-Based Surface Representation Paper
• 2409.03718
• Published Sep 5, 2024 • 27
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models Paper
• 2404.12387
• Published Apr 18, 2024 • 40
Dynamic Typography: Bringing Words to Life Paper
• 2404.11614
• Published Apr 17, 2024 • 46
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone Paper
• 2404.14219
• Published Apr 22, 2024 • 259
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding Paper
• 2404.16710
• Published Apr 25, 2024 • 80
Iterative Reasoning Preference Optimization Paper
• 2404.19733
• Published Apr 30, 2024 • 50
KAN: Kolmogorov-Arnold Networks Paper
• 2404.19756
• Published Apr 30, 2024 • 116
OmniGen: Unified Image Generation Paper
• 2409.11340
• Published Sep 17, 2024 • 115
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video
Diffusion Models Paper
• 2409.07452
• Published Sep 11, 2024 • 21
Towards a Unified View of Preference Learning for Large Language Models:
A Survey Paper
• 2409.02795
• Published Sep 4, 2024 • 72
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think Paper
• 2409.11355
• Published Sep 17, 2024 • 30
Qwen2.5-Coder Technical Report Paper
• 2409.12186
• Published Sep 18, 2024 • 154
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution Paper
• 2409.12191
• Published Sep 18, 2024 • 79
VideoPoet: A Large Language Model for Zero-Shot Video Generation Paper
• 2312.14125
• Published Dec 21, 2023 • 47
Training Language Models to Self-Correct via Reinforcement Learning Paper
• 2409.12917
• Published Sep 19, 2024 • 140
Imagine yourself: Tuning-Free Personalized Image Generation Paper
• 2409.13346
• Published Sep 20, 2024 • 69
Colorful Diffuse Intrinsic Image Decomposition in the Wild Paper
• 2409.13690
• Published Sep 20, 2024 • 13
Emu3: Next-Token Prediction is All You Need Paper
• 2409.18869
• Published Sep 27, 2024 • 98
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper
• 2409.20566
• Published Sep 30, 2024 • 54
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models Paper
• 2410.02740
• Published Oct 3, 2024 • 53
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models Paper
• 2410.02757
• Published Oct 3, 2024 • 36
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper
• 2410.02073
• Published Oct 2, 2024 • 43
Baichuan-Omni Technical Report Paper
• 2410.08565
• Published Oct 11, 2024 • 87
Animate-X: Universal Character Image Animation with Enhanced Motion
Representation Paper
• 2410.10306
• Published Oct 14, 2024 • 57
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices Paper
• 2410.11795
• Published Oct 15, 2024 • 18
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree Paper
• 2410.16268
• Published Oct 21, 2024 • 70
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes Paper
• 2410.17249
• Published Oct 22, 2024 • 44
Movie Gen: A Cast of Media Foundation Models Paper
• 2410.13720
• Published Oct 17, 2024 • 100
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens Paper
• 2410.13863
• Published Oct 17, 2024 • 37
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without
Learned Priors Paper
• 2410.16271
• Published Oct 21, 2024 • 84
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Paper
• 2410.13861
• Published Oct 17, 2024 • 56
Unbounded: A Generative Infinite Game of Character Life Simulation Paper
• 2410.18975
• Published Oct 24, 2024 • 37
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
Contrastive Loss Paper
• 2410.17243
• Published Oct 22, 2024 • 92
Representation Alignment for Generation: Training Diffusion Transformers
Is Easier Than You Think Paper
• 2410.06940
• Published Oct 9, 2024 • 12
Addition is All You Need for Energy-efficient Language Models Paper
• 2410.00907
• Published Oct 1, 2024 • 151
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation Paper
• 2410.13848
• Published Oct 17, 2024 • 36
Semantic Image Inversion and Editing using Rectified Stochastic
Differential Equations Paper
• 2410.10792
• Published Oct 14, 2024 • 31
CLEAR: Character Unlearning in Textual and Visual Modalities Paper
• 2410.18057
• Published Oct 23, 2024 • 209
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation Paper
• 2411.04997
• Published Nov 7, 2024 • 39
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models Paper
• 2411.07232
• Published Nov 11, 2024 • 68
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision Paper
• 2411.07199
• Published Nov 11, 2024 • 50
Large Language Models Can Self-Improve in Long-context Reasoning Paper
• 2411.08147
• Published Nov 12, 2024 • 65
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation Paper
• 2411.08380
• Published Nov 13, 2024 • 25
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Paper
• 2411.09595
• Published Nov 14, 2024 • 77
MagicQuill: An Intelligent Interactive Image Editing System Paper
• 2411.09703
• Published Nov 14, 2024 • 80
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper
• 2411.10440
• Published Nov 15, 2024 • 129
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement Paper
• 2411.06558
• Published Nov 10, 2024 • 36
AnimateAnything: Consistent and Controllable Animation for Video
Generation Paper
• 2411.10836
• Published Nov 16, 2024 • 24
RedPajama: an Open Dataset for Training Large Language Models Paper
• 2411.12372
• Published Nov 19, 2024 • 58
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration Paper
• 2411.10958
• Published Nov 17, 2024 • 57
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory Paper
• 2411.11922
• Published Nov 18, 2024 • 19
Stable Flow: Vital Layers for Training-Free Image Editing Paper
• 2411.14430
• Published Nov 21, 2024 • 22
Style-Friendly SNR Sampler for Style-Driven Generation Paper
• 2411.14793
• Published Nov 22, 2024 • 39
Star Attention: Efficient LLM Inference over Long Sequences Paper
• 2411.17116
• Published Nov 26, 2024 • 53
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot
Subject-Driven Image Generator Paper
• 2411.15466
• Published Nov 23, 2024 • 39
Material Anything: Generating Materials for Any 3D Object via Diffusion Paper
• 2411.15138
• Published Nov 22, 2024 • 50
OminiControl: Minimal and Universal Control for Diffusion Transformer Paper
• 2411.15098
• Published Nov 22, 2024 • 61
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent
Video Diffusion Model Paper
• 2411.17459
• Published Nov 26, 2024 • 12
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video
Generation Paper
• 2412.02259
• Published Dec 3, 2024 • 60
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability Paper
• 2411.19943
• Published Nov 29, 2024 • 63
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper
• 2412.03555
• Published Dec 4, 2024 • 135
SNOOPI: Supercharged One-step Diffusion Distillation with Proper
Guidance Paper
• 2412.02687
• Published Dec 3, 2024 • 113
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation Paper
• 2412.03069
• Published Dec 4, 2024 • 34
Imagine360: Immersive 360 Video Generation from Perspective Anchor Paper
• 2412.03552
• Published Dec 4, 2024 • 29
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion Paper
• 2412.03515
• Published Dec 4, 2024 • 27
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking
Portrait Paper
• 2412.01064
• Published Dec 2, 2024 • 47
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation Paper
• 2412.00927
• Published Dec 1, 2024 • 29
Open-Sora Plan: Open-Source Large Video Generation Model Paper
• 2412.00131
• Published Nov 28, 2024 • 33
SpotLight: Shadow-Guided Object Relighting via Diffusion Paper
• 2411.18665
• Published Nov 27, 2024 • 3
Video Depth without Video Models Paper
• 2411.19189
• Published Nov 28, 2024 • 39
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction
using Diffusion Models Paper
• 2411.18350
• Published Nov 27, 2024 • 28
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models Paper
• 2411.18613
• Published Nov 27, 2024 • 59
Pathways on the Image Manifold: Image Editing via Video Generation Paper
• 2411.16819
• Published Nov 25, 2024 • 37
Identity-Preserving Text-to-Video Generation by Frequency Decomposition Paper
• 2411.17440
• Published Nov 26, 2024 • 38
ROICtrl: Boosting Instance Control for Visual Generation Paper
• 2411.17949
• Published Nov 27, 2024 • 87
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene
Relighting Paper
• 2412.00177
• Published Nov 29, 2024 • 8
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper
• 2412.04467
• Published Dec 5, 2024 • 118
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion Paper
• 2412.04424
• Published Dec 5, 2024 • 62
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Paper
• 2412.04454
• Published Dec 5, 2024 • 71
Structured 3D Latents for Scalable and Versatile 3D Generation Paper
• 2412.01506
• Published Dec 2, 2024 • 88
A Noise is Worth Diffusion Guidance Paper
• 2412.03895
• Published Dec 5, 2024 • 29
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent
Diffusion Models Paper
• 2412.04146
• Published Dec 5, 2024 • 23
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling Paper
• 2412.05271
• Published Dec 6, 2024 • 160
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion Paper
• 2412.04301
• Published Dec 5, 2024 • 40
APOLLO: SGD-like Memory, AdamW-level Performance Paper
• 2412.05270
• Published Dec 6, 2024 • 37
STIV: Scalable Text and Image Conditioned Video Generation Paper
• 2412.07730
• Published Dec 10, 2024 • 74
UniReal: Universal Image Generation and Editing via Learning Real-world
Dynamics Paper
• 2412.07774
• Published Dec 10, 2024 • 30
Video Motion Transfer with Diffusion Transformers Paper
• 2412.07776
• Published Dec 10, 2024 • 17
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse
Viewpoints Paper
• 2412.07760
• Published Dec 10, 2024 • 55
StyleMaster: Stylize Your Video with Artistic Generation and Translation Paper
• 2412.07744
• Published Dec 10, 2024 • 20
Track4Gen: Teaching Video Diffusion Models to Track Points Improves
Video Generation Paper
• 2412.06016
• Published Dec 8, 2024 • 20
Learning Flow Fields in Attention for Controllable Person Image
Generation Paper
• 2412.08486
• Published Dec 11, 2024 • 36
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions Paper
• 2412.09596
• Published Dec 12, 2024 • 97
Paper
• 2412.08905
• Published Dec 12, 2024 • 123
Neural LightRig: Unlocking Accurate Object Normal and Material
Estimation with Multi-Light Diffusion Paper
• 2412.09593
• Published Dec 12, 2024 • 18
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution Paper
• 2412.15213
• Published Dec 19, 2024 • 28
Parallelized Autoregressive Visual Generation Paper
• 2412.15119
• Published Dec 19, 2024 • 53
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation Paper
• 2412.13649
• Published Dec 18, 2024 • 21
B-STaR: Monitoring and Balancing Exploration and Exploitation in
Self-Taught Reasoners Paper
• 2412.17256
• Published Dec 23, 2024 • 47
Paper
• 2412.15115
• Published Dec 19, 2024 • 377
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper
• 2412.10360
• Published Dec 13, 2024 • 147
GenEx: Generating an Explorable World Paper
• 2412.09624
• Published Dec 12, 2024 • 98
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding Paper
• 2412.09604
• Published Dec 12, 2024 • 38
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free
Scale Fusion Paper
• 2412.09626
• Published Dec 12, 2024 • 21
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption Paper
• 2412.09283
• Published Dec 12, 2024 • 19
Byte Latent Transformer: Patches Scale Better Than Tokens Paper
• 2412.09871
• Published Dec 13, 2024 • 108
BrushEdit: All-In-One Image Inpainting and Editing Paper
• 2412.10316
• Published Dec 13, 2024 • 36
ColorFlow: Retrieval-Augmented Image Sequence Colorization Paper
• 2412.11815
• Published Dec 16, 2024 • 26
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces Paper
• 2412.14171
• Published Dec 18, 2024 • 25
Diffusion360: Seamless 360 Degree Panoramic Image Generation based on
Diffusion Models Paper
• 2311.13141
• Published Nov 22, 2023 • 16
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining Paper
• 2501.00958
• Published Jan 1, 2025 • 110
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion
Control Paper
• 2501.01427
• Published Jan 2, 2025 • 53
LTX-Video: Realtime Video Latent Diffusion Paper
• 2501.00103
• Published Dec 30, 2024 • 50
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent
Diffusion Models Paper
• 2501.01423
• Published Jan 2, 2025 • 44
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis Paper
• 2412.19723
• Published Dec 27, 2024 • 87
Paper
• 2412.18653
• Published Dec 24, 2024 • 86
Orient Anything: Learning Robust Object Orientation Estimation from
Rendering 3D Models Paper
• 2412.18605
• Published Dec 24, 2024 • 21
DepthLab: From Partial to Complete Paper
• 2412.18153
• Published Dec 24, 2024 • 36
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization Paper
• 2412.17739
• Published Dec 23, 2024 • 41
DynamicScaler: Seamless and Scalable Video Generation for Panoramic
Scenes Paper
• 2412.11100
• Published Dec 15, 2024 • 7
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices
with Efficient Architectures and Training Paper
• 2412.09619
• Published Dec 12, 2024 • 31
PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh
Representations Paper
• 2412.05994
• Published Dec 8, 2024 • 19
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations Paper
• 2412.08580
• Published Dec 11, 2024 • 45
StreamChat: Chatting with Streaming Video Paper
• 2412.08646
• Published Dec 11, 2024 • 18
Generative Densification: Learning to Densify Gaussians for
High-Fidelity Generalizable 3D Reconstruction Paper
• 2412.06234
• Published Dec 9, 2024 • 19
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion
Transformer Paper
• 2412.07720
• Published Dec 10, 2024 • 31
Around the World in 80 Timesteps: A Generative Approach to Global Visual
Geolocation Paper
• 2412.06781
• Published Dec 9, 2024 • 23
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes Paper
• 2411.14974
• Published Nov 22, 2024 • 15
TEXGen: a Generative Diffusion Model for Mesh Textures Paper
• 2411.14740
• Published Nov 22, 2024 • 17
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting
Synthesis Paper
• 2411.16443
• Published Nov 25, 2024 • 11
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models Paper
• 2411.04996
• Published Nov 7, 2024 • 51
DimensionX: Create Any 3D and 4D Scenes from a Single Image with
Controllable Video Diffusion Paper
• 2411.04928
• Published Nov 7, 2024 • 56
ReCapture: Generative Video Camera Controls for User-Provided Videos
using Masked Video Fine-Tuning Paper
• 2411.05003
• Published Nov 7, 2024 • 71
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM
Quantization Paper
• 2411.02355
• Published Nov 4, 2024 • 52
How Far is Video Generation from World Model: A Physical Law Perspective Paper
• 2411.02385
• Published Nov 4, 2024 • 34
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
Parameters by Tencent Paper
• 2411.02265
• Published Nov 4, 2024 • 25
Adaptive Caching for Faster Video Generation with Diffusion Transformers Paper
• 2411.02397
• Published Nov 4, 2024 • 23
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D Paper
• 2411.02336
• Published Nov 4, 2024 • 24
AutoVFX: Physically Realistic Video Editing from Natural Language
Instructions Paper
• 2411.02394
• Published Nov 4, 2024 • 16
GenXD: Generating Any 3D and 4D Scenes Paper
• 2411.02319
• Published Nov 4, 2024 • 20
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
Autoencoders Paper
• 2410.22366
• Published Oct 28, 2024 • 84
One Shot, One Talk: Whole-body Talking Avatar from a Single Image Paper
• 2412.01106
• Published Dec 2, 2024 • 24
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling Paper
• 2411.18664
• Published Nov 27, 2024 • 24
FAM Diffusion: Frequency and Attention Modulation for High-Resolution
Image Generation with Stable Diffusion Paper
• 2411.18552
• Published Nov 27, 2024 • 18
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based
Image Editing Paper
• 2412.04280
• Published Dec 5, 2024 • 13
MV-Adapter: Multi-view Consistent Image Generation Made Easy Paper
• 2412.03632
• Published Dec 4, 2024 • 24
PanoDreamer: 3D Panorama Synthesis from a Single Image Paper
• 2412.04827
• Published Dec 6, 2024 • 10
GenMAC: Compositional Text-to-Video Generation with Multi-Agent
Collaboration Paper
• 2412.04440
• Published Dec 5, 2024 • 22
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos Paper
• 2501.04001
• Published Jan 7, 2025 • 47
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models Paper
• 2501.02955
• Published Jan 6, 2025 • 44
Cosmos World Foundation Model Platform for Physical AI Paper
• 2501.03575
• Published Jan 7, 2025 • 82
Dispider: Enabling Video LLMs with Active Real-Time Interaction via
Disentangled Perception, Decision, and Reaction Paper
• 2501.03218
• Published Jan 6, 2025 • 35
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for
Real-World Video Super-Resolution Paper
• 2501.02976
• Published Jan 6, 2025 • 56
An Empirical Study of Autoregressive Pre-training from Videos Paper
• 2501.05453
• Published Jan 9, 2025 • 41
OmniManip: Towards General Robotic Manipulation via Object-Centric
Interaction Primitives as Spatial Constraints Paper
• 2501.03841
• Published Jan 7, 2025 • 56
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper
• 2501.05874
• Published Jan 10, 2025 • 75
GameFactory: Creating New Games with Generative Interactive Videos Paper
• 2501.08325
• Published Jan 14, 2025 • 67
CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation Paper
• 2501.09433
• Published Jan 16, 2025 • 18
Do generative video models learn physical principles from watching
videos? Paper
• 2501.09038
• Published Jan 14, 2025 • 34
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking Paper
• 2501.09751
• Published Jan 16, 2025 • 46
Diffusion Adversarial Post-Training for One-Step Video Generation Paper
• 2501.08316
• Published Jan 14, 2025 • 36
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks Paper
• 2501.08326
• Published Jan 14, 2025 • 34
MangaNinja: Line Art Colorization with Precise Reference Following Paper
• 2501.08332
• Published Jan 14, 2025 • 62
VideoAuteur: Towards Long Narrative Video Generation Paper
• 2501.06173
• Published Jan 10, 2025 • 31
Tensor Product Attention Is All You Need Paper
• 2501.06425
• Published Jan 11, 2025 • 90
Evolving Deeper LLM Thinking Paper
• 2501.09891
• Published Jan 17, 2025 • 115
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in
Virtual 3D Spaces Paper
• 2501.12909
• Published Jan 22, 2025 • 74
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding Paper
• 2501.13106
• Published Jan 22, 2025 • 91
The Lessons of Developing Process Reward Models in Mathematical
Reasoning Paper
• 2501.07301
• Published Jan 13, 2025 • 100
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow
Models Paper
• 2412.08629
• Published Dec 11, 2024 • 13
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding Paper
• 2501.13200
• Published Jan 22, 2025 • 70
Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks Paper
• 2311.06242
• Published Nov 10, 2023 • 95
Elucidating the Design Space of Diffusion-Based Generative Models Paper
• 2206.00364
• Published Jun 1, 2022 • 18
Improving Video Generation with Human Feedback Paper
• 2501.13918
• Published Jan 23, 2025 • 53
Can We Generate Images with CoT? Let's Verify and Reinforce Image
Generation Step by Step Paper
• 2501.13926
• Published Jan 23, 2025 • 43
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training Paper
• 2501.17161
• Published Jan 28, 2025 • 125
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian
Splat Generation Paper
• 2501.16764
• Published Jan 28, 2025 • 22
MatAnyone: Stable Video Matting with Consistent Memory Propagation Paper
• 2501.14677
• Published Jan 24, 2025 • 34
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot
Planning Paper
• 2411.04983
• Published Nov 7, 2024 • 13
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads Paper
• 2401.10774
• Published Jan 19, 2024 • 60
SAMPart3D: Segment Any Part in 3D Objects Paper
• 2411.07184
• Published Nov 11, 2024 • 29
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models Paper
• 2502.01639
• Published Feb 3, 2025 • 26