XaiJu
AI Knowledge Central
AI Knowledge Central

patreon


Your Talking Avatar - 🔥NSFW🔥 Audio

Hey friends! 💛 I put together a dead-simple InfiniteTalk + WAN I2V setup so you can get talking-head video generation working in ComfyUI without guesswork. You’ll install ComfyUI, a few essential nodes, and grab the exact models this workflow expects. Everything’s below—just match the GGUF to your GPU VRAM and you’re golden.

One Click installer for my patreon supporters 💓:
https://www.patreon.com/posts/comfyui-infinite-139122339

💻 Software

  1. Get Git

    If you haven’t already, install Git from here:
    https://git-scm.com/

  2. ComfyUI — Generate video, images, audio, and more with a node graph.
    https://github.com/comfyanonymous/ComfyUI

    Download and install.

    Or If you are a 50XX Series User go here:
    https://github.com/comfyanonymous/ComfyUI

    Download and Unzip.

  3. ComfyUI-Manager
    Open the cmd while you are in your ComfyUI Folder under ComfyUI\custom_nodes
    git clone
    https://github.com/Comfy-Org/ComfyUI-Manager.git

🧩 Must-have Custom Nodes

Install these from Inside ComfyUI After installing the manager:

KJNodes
https://github.com/kijai/ComfyUI-KJNodes

rgthree-comfy
https://github.com/rgthree/rgthree-comfy

VideoHelperSuite
https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite

MelBandRoFormer (node)
https://github.com/kijai/ComfyUI-MelBandRoFormer

WanVideoWrapper
https://github.com/kijai/ComfyUI-WanVideoWrapper

🧠 Pick the right GGUF for your VRAM
8 GB → Q4_K_M → fallback Q3_K_M
12 GB → Q5_K_M → fallback Q4_K_M
16 GB → Q6_K → fallback Q5_K_M
24 GB → Q6_K (MBR fp32)
32 GB+ → Q8_0

📦 Models (what each is for)

😊 InfiniteTalk (GGUF) — Talking-head driver
This is the core GGUF model that powers lip-sync and head motion generation from audio input. It listens to the audio and turns it into realistic mouth movements, facial animation, and subtle head nods.
➡️ Store in:models/diffusion_models
https://huggingface.co/Kijai/WanVideo_comfy_GGUF/tree/main/InfiniteTalk

📹WAN 2.1 I2V 14B 480p (GGUF) — Image-to-Video backbone
Provides the motion engine that takes a still input frame and animates it into smooth video at 480p resolution. While InfiniteTalk handles facial sync, WAN I2V ensures natural motion and temporal coherence.
➡️ Store in: models/diffusion_models
https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf/tree/main

⚡Lightx2v LoRA (Lightning 4-step) — Speed booster for I2V
A distilled low-step LoRA that reduces the number of diffusion steps (down to ~4) while keeping quality. It makes the whole pipeline faster and more efficient, especially for long sequences.
➡️ Store in: models/loras
https://huggingface.co/Kijai/WanVideo_comfy/resolve/main/Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors?download=true

🧰 MelBandRoFormer (fp16 / fp32) — Vocal Separation Model
Separates raw audio into vocals and instruments, ensuring a clean speech track for lip-sync. InfiniteTalk relies on this isolated voice to avoid background noise interference.
➡️ Store in: models/diffusion_models
fp16
: optimized for GPUs up to ~24 GB VRAM.
https://huggingface.co/Kijai/MelBandRoFormer_comfy/resolve/main/MelBandRoformer_fp16.safetensors?download=true
fp32
: more stable on larger setups with high VRAM.
https://huggingface.co/Kijai/MelBandRoFormer_comfy/resolve/main/MelBandRoformer_fp32.safetensors?download=true

📜 UMT5-XXL (Text Encoder) — Prompt interpreter
A massive text encoder (based on mT5 XXL) that converts written prompts into semantic embeddings. This lets the model understand and follow style, context, and conditioning beyond audio.
➡️ Store in: models/text_encoders
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp16.safetensors?download=true

🖼️ CLIP-Vision H — Visual encoder
Processes input frames or reference images to ensure the animated video remains faithful to the original identity and composition.
➡️ Store in: models/clip_vision
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors

🛠️WAN 2.1 VAE — Latent encoder/decoder
The VAE compresses frames into latents for efficient processing and reconstructs them into visuals. Using the repackaged WAN VAE ensures maximum compatibility. ->models/vae
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors?download=true

🧰 Tencent Wav2Vec2 (Base) — Speech Embedding Model
Extracts speech embeddings from the clean vocal track. These embeddings capture phoneme- and prosody-like features, which InfiniteTalk uses to generate accurate mouth shapes and timing.
No Download link. This will be downloaded on runtime.

👉 TL;DR flow

1. Install ComfyUI + ComfyUI-Manager
2. Add the listed custom nodes
3. Download the models above
4. Launch InfiniteTalk, feed text/audio, and render

Your Talking Avatar - 🔥NSFW🔥 Audio

Comments

Hi, i haven't found a way to to port it to wan 2.2 yet. Regarding the spanish audio that is a tough i found this. but it needs to be compiled somehow it seems: https://huggingface.co/flax-community/wav2vec2-spanish/tree/main To be honest that is out of my knowledge sorry.

Chris Wenzl

Hi Chris, any chance for this workflow to: 1. move to wan2.2 (better LoRa support, higher quality) 2. include Spanish language lip sync. Currently it doesn't work well with Spanish audio (I assume due to the Chinese wan2vec2 model

Niko Louvranos

Hi, could you try to lower the output resolution or the length of the audio file? so that you get a "quick" test. Dm me if that problem persists.

Chris Wenzl

Hi Chris I can't seem to get this passed WanVideoSampler. It sticks at that. Which is 75% through the whole process but 0% through WanVideoSampler. I've tried changing the blocks to swap value to 40 but it didn't make a difference. Same with lowering the resolution of the source image. I've also tried q5km and q4km to no avail. any help would be appreciated

Iain Forbes


More Creators