Echo-LongVideo generated video gallery

Echo-LongVideo

🎬 Pushing the Frontier of Long Video Generation

Official model weights for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.

Text-to-Video Audio + Video 5 minute long video Model Weights

Model Summary

Echo-LongVideo (a.k.a. JoyAI-Echo) is a long-form, multi-shot, audio-video generation model. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently across up to five-minute videos, and a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a 7.5× inference speedup without sacrificing quality.

In human evaluation, Echo-LongVideo decisively outperforms HappyOyster (directing mode) on long-form generation and surpasses the short-video specialist Wan 2.6 on human-centric tasks.

This repository hosts the released checkpoint. Inference code is released separately — see the Usage section.

Model Details

Developed by: Echo Team @ Joy Future Academy, JD
Model type: Text-to-(Audio+Video) diffusion transformer, DMD 8-step
Modality: Text → synchronized video + audio
Backbone: Built on top of LTX-Video
Text encoder: google/gemma-3-12b-it (downloaded separately)
Resolution / length (by default): 1280 × 736, 241 frames @ 25 fps per shot
Max story length: up to 5 minutes (multi-shot)
License: LTX-2 Community License Agreement

Highlights

🎞️ Minute-level multi-shot stories from a single prompt JSON.
⚡ DMD-distilled few-step inference, ~7.5× faster than the original pipeline.
🔊 Joint audio-video generation in a single pipeline.
🧠 Paired cross-modal memory bank for story-level identity and voice consistency.

Usage

Inference is run with the standalone Echo-LongVideo inference repository.

1. Download the checkpoint

huggingface-cli download <org>/Echo-LongVideo \
  --local-dir checkpoints

Also download the Gemma text encoder:

huggingface-cli download google/gemma-3-12b-it \
  --local-dir checkpoints/gemma-3-12b

Expected layout:

checkpoints/
├── echo-longvideo-release.safetensors
└── gemma-3-12b/

2. Get the inference code

git clone https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo.git
cd JoyAI-Echo

Environment: Python 3.11 + PyTorch 2.8 + CUDA 12.8 (see the inference repo's environment.yml / requirements.txt).

3. Write a story prompt

Enhance your prompt first. We provide prompt enhancers — system prompts that expand a short story or idea into well-formed shot prompts: prompts/long_story_writer_system_prompt.md for long, multi-shot video, and prompts/short_story_writer_system_prompt.md for single-shot short video. We strongly recommend running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.

Create a JSON file under prompts/. Each file is a single object with a prompts list, where every string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.

Inside each string, write these parts in order:

Part	What to describe
Roles & Subjects	Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable.
Action & Dialogue	What the subject does and speaks.
Style	The overall visual and emotional aesthetic — e.g. realistic motorsport film language, cool daylight, restrained cinematic tension.
Camera Movement	The shot type and framing or movement — e.g. a stable close-up on the face, or a medium shot from the waist up.
Background	The setting and scene details behind the subject.
Sound Effects & BGM	The sounds in the scene and the background music — e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music

A more convenient prompt-writing workflow will be released as a director agent for everyone to use.

4. Run

python inference.py

Outputs land in inference_result/outputs/<prompt-name>/inference_<timestamp>/.

Hardware

Peak GPU memory is ~46–50 GB at the default 1280 × 736 × 241 frame setting — a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:

python inference.py --num-frames 121 --video-height 480 --video-width 832

Results

Reported Scale

Item	Value
🎬 Long-form coherent story length	5 min
⚡ Speedup over the original multi-step pipeline	7.5×
📚 Benchmark stories	100
🎞️ Generated evaluation shots	3,000
🕒 Frames per shot	241 @ 25 fps

Human Evaluation

GSB user study. Values are the percentage of user preferences.

Aspect (Long Video)	JoyAI-Echo	Tie	HappyOyster (Directing)
Visual aesthetics	63.6%	8.8%	27.6%
Audio quality	81.7%	6.5%	11.8%
Prompt following	80.6%	13.5%	5.9%
IP consistency	59.4%	12.9%	27.7%

Aspect (Short Video)	JoyAI-Echo	Tie	Wan 2.6
Visual aesthetics	58.8%	14.7%	26.5%
Audio quality	32.3%	30.9%	36.8%
Prompt following	33.8%	36.8%	29.4%

Acknowledgements

We gratefully acknowledge LTX-Video for the base video generator and Gemma for the text encoder, along with the broader open-source community.

Citation

If Echo-LongVideo helps your research or products, please cite:

@techreport{echo2026longvideo,
  title        = {Echo-LongVideo: Pushing the Frontier of Long Video Generation},
  author       = {{Echo Team @ Joy Future Academy, JD}},
  institution  = {Joy Future Academy, JD},
  year         = {2026},
  month        = {June},
  url          = {https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo}
}

License

Released under the LTX-2 Community License Agreement. By downloading or using these weights, you agree to its terms. The bundled Gemma text encoder is governed by Google's separate Gemma license.

Downloads last month: 3,128