hey, guys and gals, I’ve got some kick-ass news:
I’m gearing up to integrate local models for image generation. with a few caveats, it’ll finally deliver what many of u have been wanting and what yr “humble” servant meant to roll out from the start (just getting to it now): automatic visuals. in other words, the near-full combo -text, images, and speech - will be stitched together
for immersion, those three pieces already work, in my view, brilliantly; the only thing missing is video. :) now, straight to the caveats:
on a laptop rtx 4070, a 1152 × 640 image with 27 steps takes ~27–35 seconds to render - probably the main downside. on a desktop gpu it should be ~14–16 seconds.
on pure cpu it’s grim - think “a couple of minutes or more”
of course u can drop the resolution and tweak the step count to squeeze out more speed without a big quality hit
claude has censorship issues when it comes to automatically generating image prompts, but I’m hopeful that I can eventually work around them
the model weighs 6.7 gb, so the win-local build will swell to about 17.5 gb (not a huge deal: u download once, then u can update without the bundled locals)
I chose the iLustMix model - an sdxl stable diffusion flavor. I could have wired it up through the familiar koboldcpp, but I ditched that path because text-model requests would compete with the image model, and going with stable diffusion’s own tooling just feels cleaner
most of the groundwork is done, but this time I want a rock-solid release with zero bugs, so everything works out of the box. expect it by next friday
if you’ve already cut your teeth on sdxl models, shoot yr recommendations to me in dm or the comments
oh, and here are some proof-of-concept samples of what it’s churning out so far - straight from the model, no post-processing whatsoever:







Youz
2025-06-20 21:38:38 +0000 UTC