Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
[go: Go Back, main page]

Papers
arxiv:2509.24006

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Published on Sep 28, 2025
· Submitted by
Jintao Zhang
on Sep 30, 2025
#1 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

SLA, a trainable attention method combining sparse and linear attention, accelerates Diffusion Transformer models for video generation with minimal quality loss.

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Community

Paper author Paper submitter

SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.

SLA_motivation

With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation. SLA reduces attention computation by 95% without degrading end-to-end generation quality, yielding a 13.7x speedup in attention. The code will be available at https://github.com/thu-ml/SLA.

Paper author Paper submitter
edited Oct 9, 2025

SLA_effectiveness

SLA_efficiency

This comment has been hidden (marked as Resolved)
Paper author Paper submitter

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2509.24006
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 9

Browse 9 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.24006 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.24006 in a Space README.md to link it from this page.

Collections including this paper 9