Week 42 - Real Time Speech to Image
This week I created a real time speech to image generator.
Note: This is my first prototype, there’s a 10-20 sec lag between voice and image, be patient!
Tools I used:
Touchdesigner - to connect the input/output nodes
Whisper - speech to text
ChatGPT plugin for Touchdesigner
Stable Diffusion in Touchdesigner
All these tutorials were created by Torin Blanksmith - thanks Torin!
computerender.com - image generation API
Context
Last Saturday I went to the 19th edition of Mexico City’s Mutek Festival in a huge film studio warehouse. It was great! Among many interesting and inspiring audiovisual productions was a projection project called ‘Dream Machine’ by Huemin:
Essentially they had a microphone set up in front of a projection screen, open to anyone in the crowd. The images on the screen were constantly shifting. If you spoke into the microphone, your words translated in a few seconds to the screen as an image.
Process
There are a lot of moving parts to make this work.
After a few false starts with different approaches, I found some Touchdesigner components that gave me enough to sink my teeth into.
At a high level, here’s how it all connects:
Everything is run as nodes within Touchdesigner - a fun way to visually connect various inputs and manage the output. I can throw in Python code if there’s not a native operator for a particular action
Voice input is run through the Whisper v1 API for Speech to Text conversion
I then do a few things with that text, one of them is run it through an LLM (ChatGPT 3.5 Turbo API) plus a prompt to generate descriptive keywords that will later feed into the another model
Next I send a few different text inputs to several Text to Image nodes to generate a base image and 4 secondary image variants - here I’m using Stable Diffusion via the computerender.com API
Finally I have 4 output images that I cycle and blend through, displaying one at a time on the output
It looks like this in Touchdesigner:
There were many tweaks and small decisions I had to make in the process:
Speech to text
Problem: Initially I had just one speech to text node. This exists in 3 states: recording, processing, and text output. Within this node there needed to be a delay between finishing the recording audio, processing the transcription, and outputting the text
Solution: Two parallel node paths in opposite phases that alternate between recording and processing
Generate rich text descriptions for image model
Problem: A lot of the input of what can be spoken might not effectively translate to generate an image. For example, if someone says “Ohh umm I’m not sure what to say, haha this is hard. Hmm maybe a frog? Like, a blue frog twirling fire?” —> all we really care about is the ‘blue frog’ part
Solution: Feed transcribed text into an LLM with a prompt to generate text descriptions designed specifically to an image model. This has the added benefit that I can add additional instructions to eg. make it output in a certain style, exclude negative keywords, etc
Balance reactivity with continuity
Problem: I could just feed the transcribed text words directly into the image model. However, this results in the image swinging wildly between different inputs, eg if I say “a blue frog twirling fire” and then the next sentence “a UN conference hosted in space” - the context will shift dramatically. I prefer to have a smoother, more gradual transition between the two
Solution: Merge transcribed text every few seconds, then add to a queue which contains the last X rows of text. This holds the history of the last eg 20 seconds of transcribed text. Each time a new line is appended at the bottom, the oldest text gets dropped. In the example above, we’d then have both “a blue frog twirling fire” and “a UN conference hosted in space” in the list. I then send the whole list to the image generation model
Image generation
Problem: I need to be generating many images in parallel, and quickly enough so there isn’t too much of a lag between what is spoken and what shows up on the screen. There is a trade-off between cost, latency, and image quality. I tried using newer models (like Stable Diffusion XL via replicate) but they were too expensive/slow for this process.
Solution: Using computerender.com is fast and cheap ($0.001-$0.0025 per image) - the image quality isn’t great though.
Solution: Two parallel image generation paths
Base image: the LLM generated description from last 4 iterations of transcribed audio
Secondary layers: img2img based on base image using only the most recent audio transcription
This results in 4 images at a time, and I loop and blend through these to make it feel like things are constantly generating
Learnings
I saw this installation Saturday night, and (despite little sleep after the festival) had my first working prototype by Monday afternoon. I was in flow state for those two days figuring out how to hack it together. What a ride! That feeling is one of the reasons I love 52weeks
Seeing an end goal and recreating it is way easier than finding a new path in the dark. I can draw an interesting parallel to the way I learn best: if I approach a problem with a ‘blank canvas’ and try to learn principles without a goal, I quickly lose motivation and drop it. However, if I have a concrete and well scoped goal of what I want to achieve, I can learn exactly what I need to and I feel highly motivated
There are many ways to make something work. Probably building this inside TouchDesigner isn’t the best approach, but that was the first tool that I was (vaguely) familiar with that I made some progress with, so I went ahead with that
Using a hybrid visual/coding environment like TouchDesigner is super fun
Next steps
Reduce the latency between speaking and images appearing on the screen (there’s definitely some low hanging fruit in my current implementation)
Tweak the image model settings to increase quality and consistency
See if there’s a bigger image model API that’s still affordable
Explore running local versions of all the models - I already got a local Whisper working in Week 41, next would be a local LLM and a local Stable Diffusion
Ditch TouchDesigner and rewrite this with pure code