About This Architecture
Speech-to-speech LLM pipeline using EnCodec audio tokenization and a 300M-parameter GPT transformer decoder to generate natural voice responses directly from audio input without intermediate text. User voice is encoded into 4-codebook audio tokens (1024 vocabulary), processed through a 12-layer transformer with causal self-attention and cross-attention, then decoded back to raw waveform via EnCodec on CPU while the model runs on GPU. This pure audio-to-audio architecture eliminates speech recognition and text generation bottlenecks, enabling low-latency conversational AI. Fork and customize this diagram on Diagrams.so to adapt the tokenization strategy, adjust transformer depth, or integrate your own audio codec. The roadmap shows progression from 10M-parameter prototypes through multi-codebook training on AMD MI300X toward real-time duplex streaming and accent-specific fine-tuning.