Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
jdopensource/JoyAI-Echo Β· Hugging Face
[go: Go Back, main page]

Echo-LongVideo generated video gallery

Echo-LongVideo

🎬 Pushing the Frontier of Long Video Generation

Official model weights for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.

πŸ“„ Paper | πŸ’» Inference Code | 🧬 Model | πŸš€ Usage | πŸ“Š Results | πŸ“ Citation

Text-to-Video Audio + Video 5 minute long video Model Weights

Model Summary

Echo-LongVideo (a.k.a. JoyAI-Echo) is a long-form, multi-shot, audio-video generation model. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently across up to five-minute videos, and a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a 7.5Γ— inference speedup without sacrificing quality.

In human evaluation, Echo-LongVideo decisively outperforms HappyOyster (directing mode) on long-form generation and surpasses the short-video specialist Wan 2.6 on human-centric tasks.

This repository hosts the released checkpoint. Inference code is released separately β€” see the Usage section.

Model Details

  • Developed by: Echo Team @ Joy Future Academy, JD
  • Model type: Text-to-(Audio+Video) diffusion transformer, DMD 8-step
  • Modality: Text β†’ synchronized video + audio
  • Backbone: Built on top of LTX-Video
  • Text encoder: google/gemma-3-12b-it (downloaded separately)
  • Resolution / length (by default): 1280 Γ— 736, 241 frames @ 25 fps per shot
  • Max story length: up to 5 minutes (multi-shot)
  • License: LTX-2 Community License Agreement

Highlights

  • 🎞️ Minute-level multi-shot stories from a single prompt JSON.
  • ⚑ DMD-distilled few-step inference, ~7.5Γ— faster than the original pipeline.
  • πŸ”Š Joint audio-video generation in a single pipeline.
  • 🧠 Paired cross-modal memory bank for story-level identity and voice consistency.

Usage

Inference is run with the standalone Echo-LongVideo inference repository.

1. Download the checkpoint

huggingface-cli download <org>/Echo-LongVideo \
  --local-dir checkpoints

Also download the Gemma text encoder:

huggingface-cli download google/gemma-3-12b-it \
  --local-dir checkpoints/gemma-3-12b

Expected layout:

checkpoints/
β”œβ”€β”€ echo-longvideo-release.safetensors
└── gemma-3-12b/

2. Get the inference code

git clone https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo.git
cd JoyAI-Echo

Environment: Python 3.11 + PyTorch 2.8 + CUDA 12.8 (see the inference repo's environment.yml / requirements.txt).

3. Write a story prompt

Enhance your prompt first. We provide prompt enhancers β€” system prompts that expand a short story or idea into well-formed shot prompts: prompts/long_story_writer_system_prompt.md for long, multi-shot video, and prompts/short_story_writer_system_prompt.md for single-shot short video. We strongly recommend running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.

Create a JSON file under prompts/. Each file is a single object with a prompts list, where every string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.

Inside each string, write these parts in order:

Part What to describe
Roles & Subjects Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable.
Action & Dialogue What the subject does and speaks.
Style The overall visual and emotional aesthetic β€” e.g. realistic motorsport film language, cool daylight, restrained cinematic tension.
Camera Movement The shot type and framing or movement β€” e.g. a stable close-up on the face, or a medium shot from the waist up.
Background The setting and scene details behind the subject.
Sound Effects & BGM The sounds in the scene and the background music β€” e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music

A more convenient prompt-writing workflow will be released as a director agent for everyone to use.

4. Run

python inference.py

Outputs land in inference_result/outputs/<prompt-name>/inference_<timestamp>/.

Hardware

Peak GPU memory is ~46–50 GB at the default 1280 Γ— 736 Γ— 241 frame setting β€” a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:

python inference.py --num-frames 121 --video-height 480 --video-width 832

Results

Reported Scale

Item Value
🎬 Long-form coherent story length 5 min
⚑ Speedup over the original multi-step pipeline 7.5Γ—
πŸ“š Benchmark stories 100
🎞️ Generated evaluation shots 3,000
πŸ•’ Frames per shot 241 @ 25 fps

Human Evaluation

GSB user study. Values are the percentage of user preferences.

Aspect (Long Video) JoyAI-Echo Tie HappyOyster (Directing)
Visual aesthetics 63.6% 8.8% 27.6%
Audio quality 81.7% 6.5% 11.8%
Prompt following 80.6% 13.5% 5.9%
IP consistency 59.4% 12.9% 27.7%
Aspect (Short Video) JoyAI-Echo Tie Wan 2.6
Visual aesthetics 58.8% 14.7% 26.5%
Audio quality 32.3% 30.9% 36.8%
Prompt following 33.8% 36.8% 29.4%

Acknowledgements

We gratefully acknowledge LTX-Video for the base video generator and Gemma for the text encoder, along with the broader open-source community.

Citation

If Echo-LongVideo helps your research or products, please cite:

@techreport{echo2026longvideo,
  title        = {Echo-LongVideo: Pushing the Frontier of Long Video Generation},
  author       = {{Echo Team @ Joy Future Academy, JD}},
  institution  = {Joy Future Academy, JD},
  year         = {2026},
  month        = {June},
  url          = {https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo}
}

License

Released under the LTX-2 Community License Agreement. By downloading or using these weights, you agree to its terms. The bundled Gemma text encoder is governed by Google's separate Gemma license.

Downloads last month
3,128
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support