Speech Integrations
Audio Format
Rasa uses a common intermediate audio format called RasaAudioBytes that
acts as a standard data format to prevent complexity between different
channels, ASR engines, and TTS engines. Currently, this corresponds to:
- Raw wave format
 - 8kHz sample rate
 - 8-bit depth
 - Mono channel
 - μ-law (mulaw) encoding
 
These parameters are not configurable. Rasa uses the library audioop-lts for
conversion between audio encodings (functions like ulaw2lin() or lin2ulaw()).
Automatic Speech Recognition (ASR)
This section describes the supported integrations with Automatic Speech Recognition (ASR) or Speech To Text (STT) services.
Deepgram
Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can
request a key from Deepgram. It can be configured in a Voice Stream channel
as follows:
browser_audio:
  # ... other configuration
  asr:
    name: deepgram
Deepgram uses two mechanisms to detect when a speaker has finished talking:
- Endpointing: Uses Voice Activity Detection (VAD) to detect silence after speech
 - UtteranceEnd: Looks at word timings to detect gaps between words
 
The configuration parameters endpointing and utterance_end_ms below control these features respectively. For noisy environments, utterance_end_ms may be more reliable as it ignores non-speech audio. Read more on Deepgram Documentation
Configuration parameters
endpoint: Optional, defaults toapi.deepgram.com- The endpoint URL for the Deepgram API.endpointing: Optional, defaults to400- Number of milliseconds of silence to determine the end of speech.language: Optional, defaults toen- The language code for the speech recognition.model: Optional, defaults tonova-2-general- The model to be used for speech recognition.smart_format: Optional, defaults totrue- Boolean value to enable or disable Deepgram's smart formatting.utterance_end_ms: Optional, defaults to1000- Time in milliseconds to wait before considering an utterance complete.
Azure
Requires the python library azure-cognitiveservices-speech. The API Key can be set with the environment variable AZURE_SPEECH_API_KEY.
Sample configuration looks as follow:
browser_audio:
  # ... other configuration
  asr:
    name: azure
Configuration parameters
language: Required. The language code for the speech recognition. (See Azure documentation for a list of languages).speech_region: Optional, defaults toNone- The region identifier for the Azure Speech service, such aswestus. Ensure that the region matches the region of your subscription.speech_endpoint: Optional, defaults toNone- The service endpoint to connect to. You can use it when you have Azure Speech service behind a reverse proxy.speech_host: Optional, defaults toNone- The service host to connect to. Standard resource path will be assumed. Format is "protocol://host:port" where ":port" is optional.
While speech_region, speech_endpoint and speech_host are optional parameters. They
cannot be all empty at the same time. In that case, speech_region is set to eastus.
When connecting to Azure Cloud, parameter speech_region is enough. Here is an example config,
browser_audio:
  server_url: localhost
  asr:
    name: azure
    language: de-DE
    speech_region: germanywestcentral
  tts:
    name: azure
    language: de-DE
    voice: de-DE-KatjaNeural
    speech_region: germanywestcentral
Others
Looking for integration with a different ASR service? You can create your own custom ASR component.
Text To Speech (TTS)
This section describes the supported integrations with Text To Speech (TTS) services.
Azure TTS
The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:
browser_audio:
  # ... other configuration
  tts:
    name: azure
Configuration parameters
language: Optional, defaults toen-US- The language code for the text-to-speech conversion. (See Azure documentation for a list of languages and voices).voice: Optional, defaults toen-US-JennyNeural- The voice to be used for the text-to-speech conversion. Voice defines the specific characteristic of the voice, such as speaker's gender, age and speaking style.timeout: Optional, defaults to10- The timeout duration in seconds for the text-to-speech request.speech_region: Optional, defaults toNone- The region identifier for the Azure Speech service. Ensure that the region matches the region of your subscription.endpoint: Optional, defaults toNone- The service endpoint for Azure Speech service.
Cartesia TTS
Use the environment variable CARTESIA_API_KEY for Cartesia API Key. The API Key
requires a Cartesia account. It can be configured in a Voice Stream channel as follows,
browser_audio:
  # ... other configuration
  tts:
    name: cartesia
Configuration parameters
language: Optional, defaults toen- The language code for the text-to-speech conversion.voice: Optional, defaults to248be419-c632-4f23-adf1-5324ed7dbf1d- Theidof the voice to use for text-to-speech conversion. The parameter will be passed to the Cartesia API as"voice": {"mode": "id","id": "VALUE"}timeout: Optional, defaults to10- The timeout duration in seconds for the text-to-speech request.model_id: Optional, defaults tosonic-english- The model ID to be used for the text-to-speech conversion.version: Optional, defaults to2024-06-10- The version of the model to be used for the text-to-speech conversion.endpoint: Optional, defaults tohttps://api.cartesia.ai/tts/sse- The endpoint URL for the Cartesia API.
Deepgram TTS
Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can
request a key from Deepgram. It can be configured in a Voice Stream channel
as follows:
browser_audio:
  # ... other configuration
  tts:
    name: deepgram
Configuration parameters
Deepgram does not use the parent class parameters of language or voice as each
model is uniquely identified using the format [modelname]-[voicename]-[language].
model_id: Optional, defaults toaura-2-andromeda-en- The list of available options can be found in Deepgram Documentation.endpoint: Optional, defaults towss://api.deepgram.com/v1/speak- The endpoint URL for the Deepgram API.timeout: Optional, defaults to30- The timeout duration in seconds for the text-to-speech request.
Others
Looking for integration with a different TTS service? You can create your own custom TTS component.
Custom ASR
You can implement your own custom ASR component as a Python class to
integrate with any third-party speech recognition service. A custom ASR
component must subclass the ASREngine class from rasa.core.channels.voice_stream.asr.asr_engine.
Your custom ASR component will receive audio in the RasaAudioBytes format and may need to convert it to your service's expected format.
Required Methods
Your custom ASR component must implement the following methods:
open_websocket_connection(): Establish a websocket connection to your ASR servicefrom_config_dict(config: Dict): Class method to create an instance from configuration dictionarysignal_audio_done(): Signal to the ASR service that audio input has endedrasa_audio_bytes_to_engine_bytes(chunk: RasaAudioBytes): Convert Rasa audio format to your engine's expected formatengine_event_to_asr_event(event: Any): Convert your engine's events to Rasa'sASREventformatget_default_config(): Static method that returns the default configuration for your component
Optional Methods
You may also override these methods as needed:
send_keep_alive(): Send keep-alive messages to maintain the connection. The default implementation is only apassstatement.close_connection(): Custom cleanup when closing the connection. Default implementation is as follows,
async def close_connection(self) -> None:
  if self.asr_socket:
    await self.asr_socket.close()
ASR Events
Your engine_event_to_asr_event method should return appropriate ASREvent objects:
UserIsSpeaking(transcript): For interim/partial transcripts while the user is speakingNewTranscript(transcript): For final transcripts when the user has finished speaking
See Configuration for details on how to configure your custom ASR component.
Example Implementation
Here's an example based on the Deepgram implementation structure:
import json
import os
from dataclasses import dataclass
from typing import Any, Dict, Optional
from urllib.parse import urlencode
import websockets
from websockets.legacy.client import WebSocketClientProtocol
from rasa.core.channels.voice_stream.asr.asr_engine import ASREngine, ASREngineConfig
from rasa.core.channels.voice_stream.asr.asr_event import (
    ASREvent,
    NewTranscript,
    UserIsSpeaking,
)
from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes
@dataclass
class MyASRConfig(ASREngineConfig):
    api_key: str = ""
    endpoint: str = "wss://api.example.com/v1/speech"
    language: str = "en-US"
class MyASR(ASREngine[MyASRConfig]):
    required_env_vars = ("MY_ASR_API_KEY",)  # Optional: required environment variables
    required_packages = ("my_asr_package",)  # Optional: required Python packages
    def __init__(self, config: Optional[MyASRConfig] = None):
        super().__init__(config)
        self.accumulated_transcript = ""
    async def open_websocket_connection(self) -> WebSocketClientProtocol:
        """Connect to the ASR system."""
        api_key = os.environ["MY_ASR_API_KEY"]
        headers = {"Authorization": f"Bearer {api_key}"}
        return await websockets.connect(
            self._get_api_url_with_params(),
            extra_headers=headers
        )
    def _get_api_url_with_params(self) -> str:
        """Build API URL with query parameters."""
        query_params = {
            "language": self.config.language,
            "encoding": "mulaw",
            "sample_rate": "8000",
            "interim_results": "true"
        }
        return f"{self.config.endpoint}?{urlencode(query_params)}"
    @classmethod
    def from_config_dict(cls, config: Dict) -> "MyASR":
        """Create instance from configuration dictionary."""
        asr_config = MyASRConfig.from_dict(config)
        return cls(asr_config)
    async def signal_audio_done(self) -> None:
        """Signal to the ASR service that audio input has ended."""
        await self.asr_socket.send(json.dumps({"type": "stop_audio"}))
    def rasa_audio_bytes_to_engine_bytes(self, chunk: RasaAudioBytes) -> bytes:
        """Convert Rasa audio format to engine format."""
        # For most services, you can return the chunk directly
        # since it's already in mulaw format
        return chunk
    def engine_event_to_asr_event(self, event: Any) -> Optional[ASREvent]:
        """Convert engine response to ASREvent."""
        data = json.loads(event)
        
        if data.get("type") == "transcript":
            transcript = data.get("text", "")
            
            if data.get("is_final"):
                # Final transcript - user finished speaking
                full_transcript = self.accumulated_transcript + " " + transcript
                self.accumulated_transcript = ""
                return NewTranscript(full_transcript.strip())
            elif transcript:
                # Interim transcript - user is still speaking
                return UserIsSpeaking(transcript)
                
        return None
    @staticmethod
    def get_default_config() -> MyASRConfig:
        """Get default configuration."""
        return MyASRConfig(
            endpoint="wss://api.example.com/v1/speech",
            language="en-US"
        )
    async def send_keep_alive(self) -> None:
        """Send keep-alive message if supported by your service."""
        if self.asr_socket is not None:
            await self.asr_socket.send(json.dumps({"type": "keep_alive"}))
This structure allows you to integrate any speech recognition service with Rasa's voice capabilities while maintaining compatibility with the existing voice stream infrastructure.
Custom TTS
You can implement your own custom TTS component as a Python class to
integrate with any third-party text-to-speech service. A custom TTS component
must subclass the TTSEngine class from
rasa.core.channels.voice_stream.tts.tts_engine.
Your custom TTS component must output audio in the RasaAudioBytes format
and convert it using the engine_bytes_to_rasa_audio_bytes method.
Required Methods
Your custom TTS component must implement the following methods:
synthesize(text: str, config: Optional[T]): Generate speech from text, returning an async iterator ofRasaAudioByteschunksengine_bytes_to_rasa_audio_bytes(chunk: bytes): Convert your engine's audio format to Rasa's audio formatfrom_config_dict(config: Dict): Class method to create an instance from configuration dictionaryget_default_config(): Static method that returns the default configuration for your component
Optional Methods
You may also override these methods as needed:
close_connection(): Custom cleanup when closing connections (e.g., closing websockets or HTTP sessions). Default implementation does not do anything.
See Configuration for details on how to configure your custom TTS component.
Example Implementation
Here's an example based on the Deepgram TTS implementation structure:
import os
from dataclasses import dataclass
from typing import AsyncIterator, Dict, Optional
from urllib.parse import urlencode
import aiohttp
from aiohttp import ClientTimeout, WSMsgType
from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes
from rasa.core.channels.voice_stream.tts.tts_engine import (
    TTSEngine,
    TTSEngineConfig,
    TTSError,
)
@dataclass
class MyTTSConfig(TTSEngineConfig):
    api_key: str = ""
    endpoint: str = "wss://api.example.com/v1/speak"
    model_id: str = "en-US-standard"
class MyTTS(TTSEngine[MyTTSConfig]):
    session: Optional[aiohttp.ClientSession] = None
    required_env_vars = ("MY_TTS_API_KEY",)  # Optional: required environment variables
    required_packages = ("aiohttp",)  # Optional: required Python packages
    ws: Optional[aiohttp.ClientWebSocketResponse] = None
    def __init__(self, config: Optional[MyTTSConfig] = None):
        super().__init__(config)
        timeout = ClientTimeout(total=self.config.timeout)
    @staticmethod
    def get_request_headers(config: MyTTSConfig) -> dict[str, str]:
        """Build request headers with authentication."""
        api_key = os.environ["MY_TTS_API_KEY"]
        return {
            "Authorization": f"Bearer {api_key}",
        }
    async def close_connection(self) -> None:
        """Close WebSocket connection if it exists."""
        if self.ws and not self.ws.closed:
            await self.ws.close()
            self.ws = None
    def get_websocket_url(self, config: MyTTSConfig) -> str:
        """Build WebSocket URL with query parameters."""
        base_url = config.endpoint
        query_params = {
            "model": config.model_id,
            "language": config.language,
            "voice": config.voice,
            "encoding": "mulaw",
            "sample_rate": "8000",
        }
        return f"{base_url}?{urlencode(query_params)}"
    async def synthesize(
        self, text: str, config: Optional[MyTTSConfig] = None
    ) -> AsyncIterator[RasaAudioBytes]:
        """Generate speech from text using WebSocket TTS API."""
        config = self.config.merge(config)
        headers = self.get_request_headers(config)
        ws_url = self.get_websocket_url(config)
        try:
            self.ws = await self.session.ws_connect(
                ws_url,
                headers=headers,
                timeout=float(self.config.timeout),
            )
            
            # Send text to synthesize
            await self.ws.send_json({
                "type": "Speak",
                "text": text,
            })
            
            # Signal that we're done sending text
            await self.ws.send_json({"type": "Flush"})
            
            # Stream audio chunks
            async for msg in self.ws:
                if msg.type == WSMsgType.BINARY:
                    # Binary data is the raw audio
                    yield self.engine_bytes_to_rasa_audio_bytes(msg.data)
                elif msg.type == WSMsgType.TEXT:
                    # Check if stream is complete
                    if "Close" in msg.data or "Flushed" in msg.data:
                        break
                elif msg.type in (WSMsgType.CLOSED, WSMsgType.ERROR):
                    break
            # Send close message
            if self.ws and not self.ws.closed:
                await self.ws.send_json({"type": "Close"})
        except Exception as e:
            raise TTSError(f"Error during TTS synthesis: {e}")
        finally:
            # Ensure connection is closed
            await self.close_connection()
    def engine_bytes_to_rasa_audio_bytes(self, chunk: bytes) -> RasaAudioBytes:
        """Convert the generated TTS audio bytes into Rasa audio bytes."""
        # If your service returns audio in mulaw format already, return directly
        return RasaAudioBytes(chunk)
        
        # If your service returns audio in a different format (e.g., PCM linear),
        # you'll need to convert it. For example, to convert from linear PCM:
        # import audioop
        # mulaw_data = audioop.lin2ulaw(chunk, 2)  # 2 = 16-bit samples
        # return RasaAudioBytes(mulaw_data)
    @staticmethod
    def get_default_config() -> MyTTSConfig:
        """Get default configuration."""
        return MyTTSConfig(
            endpoint="wss://api.example.com/v1/speak",
            model_id="en-US-standard",
            language="en-US",
            voice="female-1",
            timeout=30,
        )
    @classmethod
    def from_config_dict(cls, config: Dict) -> "MyTTS":
        """Create instance from configuration dictionary."""
        return cls(MyTTSConfig.from_dict(config))
Configuration for Custom Components
To use a custom ASR or TTS component, you need to supply credentials for it in your credentials.yml file. The configuration should contain the module path of your custom class and any required configuration parameters.
The module path follows the format path.to.module.ClassName. For example:
- A class 
MyASRin fileaddons/custom_asr.pyhas module pathaddons.custom_asr.MyASR - A class 
MyTTSin fileaddons/custom_tts.pyhas module pathaddons.custom_tts.MyTTS 
Custom ASR Configuration Example
browser_audio:
  # ... other configuration
  asr:
    name: addons.custom_asr.MyASR
    api_key: "your_api_key"
    endpoint: "wss://api.example.com/v1/speech"
    language: "en-US"
    # any other custom parameters your ASR needs
Custom TTS Configuration Example
browser_audio:
  # ... other configuration
  tts:
    name: addons.custom_tts.MyTTS
    api_key: "your_api_key"
    endpoint: "wss://api.example.com/v1/speak"
    language: "en-US"
    voice: "en-US-JennyNeural"
    timeout: 30
    # any other custom parameters your TTS needs
Any custom parameters you define in your configuration class (e.g., MyASRConfig or MyTTSConfig) can be passed through the credentials file and will be available in your component via self.config.