Why FLAC is the ideal audio format for AI speech applications
I recently built Prosody Coach, a Python CLI tool that helps non-native English speakers improve their pronunciation. The app records your speech, analyzes it using scientific algorithms (Praat), and provides AI-powered coaching via Google's Gemini API.
One decision I had to make early on was which audio format to use. I'd worked with MP3 and WAV before, but neither felt right for this use case. After some research, I landed on FLAC, and it turned out to be the perfect fit.
The problem with common audio formats
Let me walk you through why the usual suspects didn't work.
MP3 is everywhere. It's small, compatible with everything, and fine for listening to music in your car. But MP3 is a lossy format. It throws away audio data to achieve compression. For speech analysis, this is a problem. When you're measuring pitch variations down to specific Hz frequencies or analyzing syllable timing patterns, you need every bit of data intact.
WAV is the opposite extreme. It's completely uncompressed, which means perfect quality. But the file sizes are massive. A 30-second recording at 16 kHz mono is around 1 MB in WAV format. That adds up quickly when you're storing multiple recordings for progress tracking, and it's inefficient for sending to cloud APIs.
Why FLAC works for speech applications
FLAC (Free Lossless Audio Codec) hits the sweet spot. It's lossless like WAV, meaning you get perfect audio fidelity. But it uses compression algorithms that reduce file sizes by 50-60% without throwing away any data.
Here's how it works: FLAC uses predictive coding. It analyzes patterns in the audio waveform and stores the differences from predicted values rather than the raw samples. When you decompress a FLAC file, you get back exactly the original audio. Bit for bit.
For Prosody Coach, this means:
- Accurate pitch analysis: The Praat algorithms measuring pitch (75-500 Hz range) get clean, uncompressed data
- Precise timing measurements: Syllable detection and rhythm analysis (nPVI calculations) depend on exact waveform data
- Smaller storage: User recordings take up roughly half the space compared to WAV
- API compatibility: Google's Gemini API accepts FLAC natively with the
audio/flacMIME type
Implementation in Prosody Coach
The audio pipeline in Prosody Coach works like this:
- Record from microphone at 16 kHz sample rate (optimal for speech)
- Capture in float32 format for processing flexibility
- Trim silence from the beginning and end
- Save as FLAC for storage
- For AI analysis, convert to base64-encoded FLAC and send to Gemini
The code for saving a recording looks something like this:
import soundfile as sf
def save_recording(audio_data, sample_rate, filepath):
sf.write(filepath, audio_data, sample_rate, format='FLAC')
And for sending to the Gemini API:
import base64
import io
def prepare_audio_for_api(audio_data, sample_rate):
buffer = io.BytesIO()
sf.write(buffer, audio_data, sample_rate, format='FLAC')
buffer.seek(0)
return base64.b64encode(buffer.read()).decode('utf-8')
The soundfile library handles the FLAC encoding, and the resulting base64 string goes directly to Gemini with the audio/flac MIME type.
Real-time streaming: a different story
One interesting exception: for real-time feedback sessions using Gemini's Live API, I use raw PCM instead of FLAC. Real-time streaming needs minimal latency, and encoding/decoding FLAC for every 100ms audio chunk would add unnecessary overhead.
For streaming, the audio gets converted to 16-bit PCM:
import numpy as np
def convert_to_pcm(audio_float32):
# Convert float32 [-1.0, 1.0] to int16 [-32768, 32767]
return (audio_float32 * 32767).astype(np.int16).tobytes()
So the format choice depends on context: FLAC for storage and batch analysis, PCM for real-time streaming.
When to use FLAC
Based on my experience building Prosody Coach, FLAC makes sense when:
- You need to preserve audio quality for analysis or processing
- Storage space matters but you can't sacrifice fidelity
- You're sending audio to APIs that support FLAC (Google's AI services do)
- You want an open, patent-free format
FLAC might not be the best choice if file size is the primary concern (use a lossy format) or if you need real-time encoding with minimal latency (use PCM).