Initial Local AI Tinkering

After setting up my new graphics card, I began to make use of the 32GB of VRAM to test out some larger models. I started by testing a large image model to generate some images. I already had ComfyUI set up from my previous testing, so I just swapped in a bigger model. The results looked significantly better than what I was able to achieve before, with about the same processing speed. The detail of prompts definitely still matter though and I don’t quite have the knack for it myself. But I found asking Grok to expand what I write into a prompt worked pretty well. Here’s the same prompt of wizard bartenders with v1-5-pruned-emaonly (4GB) and Jib Mix Realistic XL (13 GB).

image alt text
You can kinda tell what's going on. It's only 500x500 pixels

image alt text
Now we're talking

I spent a lot more time fiddling with voice models. The big models like VibeVoice make good results, but aren’t quite real time. What I really want it to be able to seamlessly have a conversation with a model that has a cloned voice. If I clone it to my voice, and train an LLM with my way of speaking, I could theoretically make a somewhat convincing voice chatbot of myself. Basically the pipeline is Speech To Text (STT) ➝ LLM ➝ Text To Speech (TTS). I don’t think it would particularly fool anyone but I do wonder how close I can get it.

I ran all of this on a Windows machine running WSL with Ubuntu 20.04 and this setup was simply too old. I constantly ran into Python package compatibility problems either with my graphics card or with other packages. I eventually gave up and ran a cobbled together Python script on Windows that moved the data between the components. It used Whisper for STT, a 8B Quen model for the LLM, and Kokoro with KokoClone. The voice quality was pretty mid, but more importantly the delay was way too much. Even with some optimization and tweaking, there was still a 3-5 second delay in the processing. I used Ollama for all this, but I think my next attempt at improving performance will be to create a fresh WSL image and compile wisper.cpp and llama.cpp directly and maybe rewrite the script in C.