Echo-LongVideo
π¬ Pushing the Frontier of Long Video Generation
Official model weights for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.
π Paper | π» Inference Code | 𧬠Model | π Usage | π Results | π Citation
Model Summary
Echo-LongVideo (a.k.a. JoyAI-Echo) is a long-form, multi-shot, audio-video generation model. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently across up to five-minute videos, and a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a 7.5Γ inference speedup without sacrificing quality.
In human evaluation, Echo-LongVideo decisively outperforms HappyOyster (directing mode) on long-form generation and surpasses the short-video specialist Wan 2.6 on human-centric tasks.
This repository hosts the released checkpoint. Inference code is released separately β see the Usage section.
Model Details
- Developed by: Echo Team @ Joy Future Academy, JD
- Model type: Text-to-(Audio+Video) diffusion transformer, DMD 8-step
- Modality: Text β synchronized video + audio
- Backbone: Built on top of LTX-Video
- Text encoder:
google/gemma-3-12b-it(downloaded separately) - Resolution / length (by default): 1280 Γ 736, 241 frames @ 25 fps per shot
- Max story length: up to 5 minutes (multi-shot)
- License: LTX-2 Community License Agreement
Highlights
- ποΈ Minute-level multi-shot stories from a single prompt JSON.
- β‘ DMD-distilled few-step inference, ~7.5Γ faster than the original pipeline.
- π Joint audio-video generation in a single pipeline.
- π§ Paired cross-modal memory bank for story-level identity and voice consistency.
Usage
Inference is run with the standalone Echo-LongVideo inference repository.
1. Download the checkpoint
huggingface-cli download <org>/Echo-LongVideo \
--local-dir checkpoints
Also download the Gemma text encoder:
huggingface-cli download google/gemma-3-12b-it \
--local-dir checkpoints/gemma-3-12b
Expected layout:
checkpoints/
βββ echo-longvideo-release.safetensors
βββ gemma-3-12b/
2. Get the inference code
git clone https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo.git
cd JoyAI-Echo
Environment: Python 3.11 + PyTorch 2.8 + CUDA 12.8 (see the inference repo's environment.yml / requirements.txt).
3. Write a story prompt
Enhance your prompt first. We provide prompt enhancers β system prompts that expand a short story or idea into well-formed shot prompts: prompts/long_story_writer_system_prompt.md for long, multi-shot video, and prompts/short_story_writer_system_prompt.md for single-shot short video. We strongly recommend running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.
Create a JSON file under prompts/. Each file is a single object with a prompts list, where every string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.
Inside each string, write these parts in order:
| Part | What to describe |
|---|---|
| Roles & Subjects | Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable. |
| Action & Dialogue | What the subject does and speaks. |
| Style | The overall visual and emotional aesthetic β e.g. realistic motorsport film language, cool daylight, restrained cinematic tension. |
| Camera Movement | The shot type and framing or movement β e.g. a stable close-up on the face, or a medium shot from the waist up. |
| Background | The setting and scene details behind the subject. |
| Sound Effects & BGM | The sounds in the scene and the background music β e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music |
A more convenient prompt-writing workflow will be released as a director agent for everyone to use.
4. Run
python inference.py
Outputs land in inference_result/outputs/<prompt-name>/inference_<timestamp>/.
Hardware
Peak GPU memory is ~46β50 GB at the default 1280 Γ 736 Γ 241 frame setting β a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:
python inference.py --num-frames 121 --video-height 480 --video-width 832
Results
Reported Scale
| Item | Value |
|---|---|
| π¬ Long-form coherent story length | 5 min |
| β‘ Speedup over the original multi-step pipeline | 7.5Γ |
| π Benchmark stories | 100 |
| ποΈ Generated evaluation shots | 3,000 |
| π Frames per shot | 241 @ 25 fps |
Human Evaluation
GSB user study. Values are the percentage of user preferences.
| Aspect (Long Video) | JoyAI-Echo | Tie | HappyOyster (Directing) |
|---|---|---|---|
| Visual aesthetics | 63.6% | 8.8% | 27.6% |
| Audio quality | 81.7% | 6.5% | 11.8% |
| Prompt following | 80.6% | 13.5% | 5.9% |
| IP consistency | 59.4% | 12.9% | 27.7% |
| Aspect (Short Video) | JoyAI-Echo | Tie | Wan 2.6 |
|---|---|---|---|
| Visual aesthetics | 58.8% | 14.7% | 26.5% |
| Audio quality | 32.3% | 30.9% | 36.8% |
| Prompt following | 33.8% | 36.8% | 29.4% |
Acknowledgements
We gratefully acknowledge LTX-Video for the base video generator and Gemma for the text encoder, along with the broader open-source community.
Citation
If Echo-LongVideo helps your research or products, please cite:
@techreport{echo2026longvideo,
title = {Echo-LongVideo: Pushing the Frontier of Long Video Generation},
author = {{Echo Team @ Joy Future Academy, JD}},
institution = {Joy Future Academy, JD},
year = {2026},
month = {June},
url = {https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo}
}
License
Released under the LTX-2 Community License Agreement. By downloading or using these weights, you agree to its terms. The bundled Gemma text encoder is governed by Google's separate Gemma license.
- Downloads last month
- 3,128