Building an AI audiogen VST and AU plugin
Building an AI-Powered Audio Sample Generator: From Weekend Experiment to Full-Fledged VST/AU Plugin
Few moments in music production are more painful than scrolling through endless samples, searching for that one perfect kick drum or atmospheric pad—only to give up and settle for something that's "close enough." As a musician and developer, I always wondered: Isn't there a more direct way to go from idea to sound? Over the past year, I explored exactly that question by building a prototype of an AI-driven audio sample generator. What started as a weekend hack quickly snowballed into a cross-platform VST/AU plugin and accompanying cloud service. In this post, I'll share my journey, highlight the weird challenges of shipping AI inside a DAW environment, and give a snapshot of how I monetized it without imposing heavy subscription fees.
Backstory: Random Hacks and Frequent Frustration
Like many side projects, this started with an annoyance: I'm a hobbyist producer in my off-hours, and I'm prone to the dreaded "sample hunt." I'd have a groove going, then realize I needed a snare with just the right snap—or a weird ambi-texture that felt both "noir" and "metallic." That meant rummaging through external libraries and marketplace sites. By the time I found something that worked, my creative spark was flattened.
I'd been tinkering with generative models for images, messing with Stable Diffusion, DALL·E, and others. So I asked myself: "Why not apply the same concept to audio?" That was enough to keep me busy one weekend. I forked a stable audio diffusion repo, wrote some hacky Python scripts, and had it spit out short WAV files based on text prompts. Results were unpolished, often glitchy, and the latency was awful. But I felt a certain magic in typing "8-bit glitchy loop with a console-game vibe" and hearing that materialize.
The Big Leap: Embedding AI Into a DAW
Phase One was a local proof-of-concept with raw Python scripts. While novel, that approach was useless for practical music production. An "official" plugin is the gold standard: it integrates cleanly with your existing DAW (Ableton, Logic, FL Studio, etc.) and appears as an instrument or effect unit. Enter JUCE, a C++ framework for audio applications. JUCE abstracts away a lot of the platform-specific headaches, supporting VST3, AU, and even AAX under one codebase.
Why JUCE?
- Cross-Platform: By handling Mac, Windows, and Linux (to an extent), JUCE ensures that you don't have to rewrite huge chunks of code for each OS.
- Audio Routing and Buffers: It gives you consistent ways to deal with incoming/outgoing audio buffers, letting me focus on the AI logic instead of plumbing details.
- Community: There's a massive JUCE community well-versed in the numerous quirks of each DAW. That's crucial when you inevitably stumble into corner cases.
Real-Time-ish Generation
AI-driven audio generation—especially with diffusion models—takes time, even on a beefy GPU. The nuance is that in a DAW, you generally expect near-zero latency. If you hit a note on your MIDI keyboard, you want instant sound. That's not how diffusion models work right now. On my first attempt, if the user typed in "dark ambient pad, 10-second loop," the plugin basically froze for up to 5 minutes while generating. That user experience was a nonstarter.
I solved this by splitting the workflow:
- DAW side: The plugin is basically a front-end UI with minimal DSP code. It collects the text prompt, triggers background generation, and waits. Meanwhile, the DAW remains responsive.
- Cloud side: The actual AI generation pipeline runs on GPU servers (think AWS, Azure, or a specialized GPU host). Once the audio is generated, it streams back to the plugin automatically.
A progress bar in the plugin keeps the producer in the loop, showing how close the sample is to completion. When it's done, there's a button to drop that generated clip directly onto the track.
Architectural Nuts & Bolts
- Next.js Web Application: Provides a user dashboard for signing up, buying credits, and browsing generated samples.
- Serverless (or Server-Full) Backend: Houses user data, handles auth tokens, routes AI generation requests to GPU containers, and streams the results back.
- GPU Containers: Because audio diffusion is compute-hungry, I rely on GPU instances to do the heavy lifting.
Handling Authentication and Credits
Hacker News folks know that monetizing a side project can be trickier than the original coding. I didn't want to commit to a monthly subscription, because a lot of producers might only generate a few samples here and there. So I chose a pay-per-second model:
- 1 credit = 1 second of generated audio, which costs $0.01.
- If you only need a 10-second loop, that's 10 credits (10 cents).
- The plugin is free to download and tries to be transparent with credit usage.
If your account runs out of credits, the plugin politely tells you to purchase more or bails out gracefully.
Roadblocks, Hacks, and Workarounds
1. DAW Quirks
Different DAWs have their own "interpretations" of plugin specs. Some treat my plugin like an instrument, others like an effect. I had to tweak descriptors in the JUCE project to ensure it was recognized as an audio generator. Also, offline rendering in some DAWs can call the plugin in a peculiar non-real-time way. The solution? Extra conditional logic in the plugin to handle requests from the DAW that might happen "faster than real time."
2. Model Consistency vs. Creativity
Audio diffusion models can be unpredictable. Sometimes you want that unpredictability for creative spark, but it can also yield results that vary drastically—especially if you typed something ambiguous like "spacey pads."
3. Infrastructure Cost
Running GPUs in the cloud gets expensive. I rely on a usage-based credit system to keep servers chugging. There's also an auto-scaling component: if requests spike, more GPU containers spin up (and you pay some serious bills). If traffic is idle, containers spin down. That elasticity is essential for controlling overhead while providing a decent user experience.
The Future of AI Audio
Generative AI for audio is still in its infancy. Compared to text or image generation, audio has tougher constraints: we're dealing with waveforms that need consistent timing, pitch, and clarity. But new techniques—like neural vocoders, improved diffusion approaches, or specialized real-time systems—are emerging at breakneck speed. We're also on the cusp of integrating user-provided melodic or harmonic data, letting you feed a rough vocal hum to the plugin and have it generate a refined instrument track.
Interestingly, some big tech players have teased even more advanced generative music models. It's possible we'll see entire song structures with chord progressions and orchestration appear from a single prompt. At that point, the conversation about AI's role in creativity heats up: Are we augmenting the creative process or displacing it? I like to think we're lowering barriers for artists to iterate and experiment freely.
Try It Out and Get Involved
If any of this sounds intriguing, the beta is open to the public:
- Visit text-to-sample.com
- Download and install the plugin for your DAW
- Get started with free test credits automatically
I'm especially curious about how people use it for:
- Sound design
- Game audio
- Experimental live coding setups
As an open-minded dev, I love hearing new angles I never would've considered.
Your feedback also helps shape the roadmap. I'm already implementing the ability to lock a clip to a key signature, handle advanced multi-lingual prompts, and reduce waiting times by adopting more streamlined diffusion techniques. If you have suggestions—like a feature to sing into your microphone and have it morph your voice into a string ensemble—don't be shy. I can't promise everything, but I'll certainly prototype interesting ideas.
Overall, this project proved to me that even though generative AI for audio is complicated, bridging it into standard workflow tools opens a world of creative possibilities. For those who have half a mind to build a custom plugin or web service around AI, I hope this story shows that a scrappy weekend experiment can indeed evolve into a real product—complete with a monetization model, a hosting infrastructure, and a small but enthusiastic user base.
Happy generating, and I look forward to hearing what you create!