Speech X: Combining audio and language the innovative way

NAVER’s flagship multimodal LLM
Conversation is more than the sum of its sounds because our words reveal emotions, intentions, and even a glimpse of our cultural background. Wouldn’t it be exciting if a computer could understand and reproduce these same complexities? Speech X, NAVER’s multimodal large language model (LLM), is our answer to this question. If Speech X and other AI models like it could understand human language at a deeper level and say things out loud in a natural voice, they may be our guiding tool into the future.

Multimodal LLMs are built on recent language models, exploring new possibilities by blurring the line between text and audio data. This May, Open AI released its latest model, GPT-4o (“o” stands for “omni”), which can process virtually any input—including text, image, and audio—at once. We expect this combination of audio, vision, and text to make its way across diverse domains, allowing for a fuller interaction between AI and humans.

In this post, we’ll explore how Speech X, our flagship multimodal LLM, has the potential to transform the way we communicate.

Technology underpinning Speech X
Speech X is a state-of-the-art technology that combines text and audio data to generate natural-sounding voice outputs. You can converse with Speech X just as you would with a real person because it’s designed to recognize sophisticated speech patterns and understand the human language and its context. We also trained the model on wide-ranging audio data so Speech X could learn the language structure and improve pronunciation. With a grasp of the subtle nuances of human language and unprecedented emotional depth, Speech X can generate audio outputs that resemble human speech.

Speech X is built on HyperCLOVA X, NAVER’s latest LLM, and leverages the universal speech dialog model (USDM) to add audio capabilities. It is a single model that can perform tasks that previously took several models, achieving this feat through a high understanding of context and user instructions, accurate speech recognition, and generation of natural-sounding audio outputs.

Processing audio requires complex algorithms and large-scale data training, so the range and quality of data and the model’s speed matter. Speech X is trained on massive amounts of high-quality audio data and applies cutting-edge deep learning technology to generate the most natural voices possible.

Speech X can perform a variety of tasks. See what the model can do in these four areas:

• Speech recognition: Speech X excels at grasping what the speakers are saying and catching the nuances of what’s being said. Instead of simply taking in words, Speech X identifies the words in a particular context and background to output relevant responses. It has the remarkable ability to take the same sentence to mean different things in different situations.
• Conversation: You can converse with Speech X on just about anything and get natural-sounding answers to your questions and instructions. Whether you’re reaching out for customer support, asking for help from virtual assistants, or playing interactive games, Speech X makes you feel like you’re talking to a human.

※ Speech recognition and conversation sample

No.	Classification	Description	Sound
1	Q	[user’s voice] Can you help me prepare materials for the meeting?
1	A	[Speech X text and audio response] Oh, the meeting materials? Sorry, I can’t help you right now. I’m out for a meeting.
2	Q	[user’s voice] I’m really worried because a heavy rain warning has been issued.
2	A	[Speech X text and audio response] I hear it’s causing flood damage throughout Seoul, and a section of Subway Line 1 is currently suspended. Dangerous road conditions are also causing traffic jams, so you should drive carefully today.
3	Q	[user’s voice] Do you have anything special going on this weekend?
3	A	[Speech X text and audio response] Yeah, I’ve got plans with friends in the evening.

• Speech synthesis: Speech X converts text into natural, human-like speech. It can analyze the tone of a text and find the right emotion that goes with it when reading the text out loud. For example, a sad story would be read in a sad voice, while delightful news would be delivered in a bright and lively voice.

※ Speech synthesis sample

No.	Classification	Description	Sound
1	Q	[user’s demand] A woman in her 40s is flustered and babbles. [user’s text] One day, I heard a sound from a forest near the village. The village people were afraid to go into the forest, but the girl gathered her courage and went into it. Deep in the forest, she found a small bird crying.	–
1	A	[Speech X audio response] One day, I heard a sound from a forest near the village. The village people were afraid to go into the forest, but the girl gathered her courage and went into it. Deep in the forest, she found a small bird crying.

What’s next for Speech X?
At the pace at which Speech X is advancing, we’re hopeful the technology will revolutionize audio services across fields. Speech X will evolve into an ever more sophisticated platform that supports audio tasks ranging from personalization to real-time translation in education and professional sectors. Each feature is customized for specific use cases to deliver the best user experience.

• Personalized audio services: Speech X delivers a customized experience according to your preferences. It can pick up your speaking habits, such as the way you stress or pronounce certain words. Speech X can also discern if you say some words or sentences more often than others to output audio that sounds like you.
• Speech-to-speech translation in real time: Transcend language barriers to facilitate communication across borders. Speech X will translate what you say on the fly, effectively acting as the go-between. Whether you’re attending business meetings, traveling around, or simply talking to people from a different culture and background, don’t let language get in the way of your communication.
• Synthetic audio infused with emotion: Speech X can pick out the mood conveyed in a text and express emotion when outputting audio. It can also catch the underlying emotion in a text to generate life-like voices. Emotional AI can transform industries like customer support, therapy, and entertainment.
• Conversational AI: Speech X responds to your questions and instructions just as a human would. It understands your intent precisely and responds fluently. This generative AI can be especially helpful for building smart-home devices, virtual assistants, and customer support systems.
• AI-enabled education: Speech X delivers content with natural-sounding articulation. You can generate educational materials in different languages. You can also practice with immaculately pronounced content when learning a new language, helping you learn more quickly.
• Application in specialized areas: You can rely on our automated audio services when you need the help of specialists. Speech X is proficient in law, medicine, and other professional areas and will give you the information you’re looking for with high accuracy. In the health sector, for example, Speech X listens to a patient’s symptoms to diagnose disease, and in the field of law, it gives legal advice.

Social and ethical considerations
We recognize the need to address privacy and security concerns inherent to AI-enabled audio services and are working to overcome ethical challenges in a transparent manner. This will require establishing explicit guidelines and policies regarding how to deploy synthetic voices responsibly. NAVER puts safety above everything else and will continue to work toward creating safe AI services.

Looking ahead
At the core of NAVER’s AI audio capabilities lies its ability to understand how people interact with language and generate audio, opening new possibilities for different modes of communication. Speech X is a leap in technological progress with the potential to transform our daily lives and broader professional areas. We hope you will continue to follow the recent AI research and development with a keen interest and be part of the innovations happening around us.

*Sources
Universal Speech Dialog Model (USDM)
HyperCLOVA X Technical Report