Google Unveils Advanced Audio Features for Gemini 2.5 AI Model

Google announced Wednesday that its Gemini 2.5 artificial intelligence model now includes native audio capabilities that enable real-time conversations and sophisticated speech generation, marking a significant advancement in AI-powered audio technology.

New native audio capabilities in Gemini 2.5 enable text-to-speech in over 24 languages. 🔊Voices are more natural and expressive, and you can seamlessly switch between languages. pic.twitter.com/UgrdCgOzI7
— Google (@Google) June 3, 2025

The new features allow users to engage in natural conversations with the AI system, which can understand tone, accent and non-verbal sounds like laughter while responding with appropriate emotional expression and timing, according to a company blog post.

“Human conversation is rich and nuanced, with meaning conveyed not just by what is said, but how it’s spoken,” Google said in announcing the capabilities.

The company believes conversation will become a primary method for interacting with AI systems.

Index

Real-Time Dialog Capabilities

The Gemini 2.5 Flash preview model processes and generates speech directly in audio format rather than converting between text and speech, enabling more fluid interactions with very low latency.

Key features include the ability to control speaking style through natural language prompts, allowing users to request specific accents, tones or even whispering. The system can also integrate tools and function calling during conversations, incorporating real-time information from sources like Google Search.

The AI demonstrates contextual awareness by distinguishing between relevant speech and background noise, understanding when to respond appropriately. It supports more than 24 languages and can mix languages within the same conversation.

The system also includes audio-video understanding capabilities, allowing it to discuss visual content from video feeds or screen sharing while maintaining conversation flow.

Advanced Text-to-Speech Technology

Google’s enhanced text-to-speech capabilities offer unprecedented control over generated audio content. Users can create content ranging from short snippets to long-form narratives with precise control over style, tone and emotional expression.

The technology can generate expressive readings for various content types, from poetry to newscasts, and can produce multi-speaker dialogue similar to the conversational format used in NotebookLM’s Audio Overviews feature.

Enhanced pace and pronunciation controls allow for more accurate delivery, while multilingual support enables content creation across Google’s supported language portfolio.

Developer Access and Safety Measures

Developers can access these features through Google AI Studio and Vertex AI platforms. The company offers two model options: Gemini 2.5 Pro Preview for complex applications and Gemini 2.5 Flash Preview for cost-efficient everyday use.

Google said it has implemented comprehensive safety measures throughout the development process, including internal and external evaluations and red team testing. All audio outputs include SynthID watermarking technology to identify AI-generated content.

The technology is already deployed in existing Google products, including NotebookLM’s Audio Overviews and Project Astra, demonstrating real-world applications of the native audio capabilities.

The announcement comes as tech companies increasingly focus on making AI interactions more natural and conversational, with audio representing a key frontier in human-computer interaction.

Real-Time Dialog Capabilities

Advanced Text-to-Speech Technology

Developer Access and Safety Measures

Related Posts