Hey friends! 💛 I put together a dead-simple InfiniteTalk + WAN I2V setup so you can get talking-head video generation working in ComfyUI without guesswork. You’ll install ComfyUI, a few essential nodes, and grab the exact models this workflow expects. Everything’s below—just match the GGUF to your GPU VRAM and you’re golden.
One Click installer for my patreon supporters 💓:
https://www.patreon.com/posts/comfyui-infinite-139122339
💻 Software
Get Git
If you haven’t already, install Git from here:
https://git-scm.com/
ComfyUI — Generate video, images, audio, and more with a node graph.
https://github.com/comfyanonymous/ComfyUI
Download and install.
Or If you are a 50XX Series User go here:
https://github.com/comfyanonymous/ComfyUI
Download and Unzip.
ComfyUI-Manager
Open the cmd while you are in your ComfyUI Folder under ComfyUI\custom_nodes
git clone https://github.com/Comfy-Org/ComfyUI-Manager.git
🧩 Must-have Custom Nodes
Install these from Inside ComfyUI After installing the manager:
KJNodes
https://github.com/kijai/ComfyUI-KJNodes
rgthree-comfy
https://github.com/rgthree/rgthree-comfy
VideoHelperSuite
https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
MelBandRoFormer (node)
https://github.com/kijai/ComfyUI-MelBandRoFormer
WanVideoWrapper
https://github.com/kijai/ComfyUI-WanVideoWrapper
🧠 Pick the right GGUF for your VRAM
8 GB → Q4_K_M → fallback Q3_K_M
12 GB → Q5_K_M → fallback Q4_K_M
16 GB → Q6_K → fallback Q5_K_M
24 GB → Q6_K (MBR fp32)
32 GB+ → Q8_0
📦 Models (what each is for)
😊 InfiniteTalk (GGUF) — Talking-head driver
This is the core GGUF model that powers lip-sync and head motion generation from audio input. It listens to the audio and turns it into realistic mouth movements, facial animation, and subtle head nods. ➡️ Store in:models/diffusion_models
https://huggingface.co/Kijai/WanVideo_comfy_GGUF/tree/main/InfiniteTalk
📹WAN 2.1 I2V 14B 480p (GGUF) — Image-to-Video backbone
Provides the motion engine that takes a still input frame and animates it into smooth video at 480p resolution. While InfiniteTalk handles facial sync, WAN I2V ensures natural motion and temporal coherence. ➡️ Store in: models/diffusion_models
https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf/tree/main
⚡Lightx2v LoRA (Lightning 4-step) — Speed booster for I2V
A distilled low-step LoRA that reduces the number of diffusion steps (down to ~4) while keeping quality. It makes the whole pipeline faster and more efficient, especially for long sequences. ➡️ Store in: models/loras
https://huggingface.co/Kijai/WanVideo_comfy/resolve/main/Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors?download=true
🧰 MelBandRoFormer (fp16 / fp32) — Vocal Separation Model
Separates raw audio into vocals and instruments, ensuring a clean speech track for lip-sync. InfiniteTalk relies on this isolated voice to avoid background noise interference.
➡️ Store in: models/diffusion_models
fp16: optimized for GPUs up to ~24 GB VRAM.
https://huggingface.co/Kijai/MelBandRoFormer_comfy/resolve/main/MelBandRoformer_fp16.safetensors?download=true
fp32: more stable on larger setups with high VRAM.
https://huggingface.co/Kijai/MelBandRoFormer_comfy/resolve/main/MelBandRoformer_fp32.safetensors?download=true
📜 UMT5-XXL (Text Encoder) — Prompt interpreter
A massive text encoder (based on mT5 XXL) that converts written prompts into semantic embeddings. This lets the model understand and follow style, context, and conditioning beyond audio. ➡️ Store in: models/text_encoders
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp16.safetensors?download=true
🖼️ CLIP-Vision H — Visual encoder
Processes input frames or reference images to ensure the animated video remains faithful to the original identity and composition. ➡️ Store in: models/clip_vision
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors
🛠️WAN 2.1 VAE — Latent encoder/decoder
The VAE compresses frames into latents for efficient processing and reconstructs them into visuals. Using the repackaged WAN VAE ensures maximum compatibility. ->models/vae
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors?download=true
🧰 Tencent Wav2Vec2 (Base) — Speech Embedding Model
Extracts speech embeddings from the clean vocal track. These embeddings capture phoneme- and prosody-like features, which InfiniteTalk uses to generate accurate mouth shapes and timing.
No Download link. This will be downloaded on runtime.
👉 TL;DR flow
1. Install ComfyUI + ComfyUI-Manager
2. Add the listed custom nodes
3. Download the models above
4. Launch InfiniteTalk, feed text/audio, and render
Chris Wenzl
2025-10-28 22:25:57 +0000 UTCNiko Louvranos
2025-10-28 13:14:33 +0000 UTCChris Wenzl
2025-10-19 15:01:46 +0000 UTCIain Forbes
2025-10-18 20:29:15 +0000 UTC