
Video : https://youtu.be/3j9c_-mRKfg
In this video, you’ll learn how to use ComfyUI Qwen 3 VL—a powerful vision language model—directly inside ComfyUI to generate detailed text prompts from images or videos, and then use those prompts to create new AI-generated content. We walk through both image and video workflows, showing how Qwen VL can analyze visual input and produce rich, time-coded descriptions that feed into diffusion models like WAN 2.2 or SDXL. Whether you're refining images with multi-stage sampling, applying LoRAs for style control, or generating synchronized video narratives, this tutorial gives you a practical, local, and customizable pipeline. This content is perfect for AI artists, ComfyUI users, and creators who want to move beyond basic prompting and explore dynamic, vision-driven generation. It matters because it bridges advanced multimodal AI with real-world creative workflows—no cloud APIs, no subscriptions, just local control and creative freedom.
Resources:
Qwen3-VL-4B-Instruct
https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
Qwen3-VL-4B-Instruct-FP8
https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct-FP8
Qwen3-VL-8B-Instruct
https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
ComfyUI-QwenVL
https://github.com/1038lab/ComfyUI-QwenVL
Tutorial Example Workflows :
Steve Lam
2025-10-23 09:08:53 +0000 UTCJason Walsh
2025-10-22 00:30:03 +0000 UTCMax Schreck
2025-10-21 19:01:32 +0000 UTC