Speech Integrations

Audio Format

Rasa uses a common intermediate audio format called RasaAudioBytes that acts as a standard data format to prevent complexity between different channels, ASR engines, and TTS engines. Currently, this corresponds to:

Raw wave format
8kHz sample rate
8-bit depth
Mono channel
μ-law (mulaw) encoding

These parameters are not configurable. Rasa uses the library audioop-lts for conversion between audio encodings (functions like ulaw2lin() or lin2ulaw()).

Automatic Speech Recognition (ASR)

This section describes the supported integrations with Automatic Speech Recognition (ASR) or Speech To Text (STT) services.

Deepgram

Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can request a key from Deepgram. It can be configured in a Voice Stream channel as follows:

credentials.yml
browser_audio:
  # ... other configuration
  asr:
    name: deepgram

Turn Detection

Deepgram uses two mechanisms to detect when a speaker has finished talking:

Endpointing: Uses Voice Activity Detection (VAD) to detect silence after speech
UtteranceEnd: Looks at word timings to detect gaps between words

The configuration parameters endpointing and utterance_end_ms below control these features respectively. For noisy environments, utterance_end_ms may be more reliable as it ignores non-speech audio. Read more on Deepgram Documentation

Configuration parameters

endpoint: Optional, defaults to api.deepgram.com - The endpoint URL for the Deepgram API.
endpointing: Optional, defaults to 400 - Number of milliseconds of silence to determine the end of speech.
language: Optional, defaults to en - The language code for the speech recognition.
model: Optional, defaults to nova-2-general - The model to be used for speech recognition.
smart_format: Optional, defaults to true - Boolean value to enable or disable Deepgram's smart formatting.
utterance_end_ms: Optional, defaults to 1000 - Time in milliseconds to wait before considering an utterance complete.

Azure

Requires the python library azure-cognitiveservices-speech. The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:

credentials.yml
browser_audio:
  # ... other configuration
  asr:
    name: azure

Configuration parameters

language: Required. The language code for the speech recognition. (See Azure documentation for a list of languages).
speech_region: Optional, defaults to None - The region identifier for the Azure Speech service, such as westus. Ensure that the region matches the region of your subscription.
speech_endpoint: Optional, defaults to None - The service endpoint to connect to. You can use it when you have Azure Speech service behind a reverse proxy.
speech_host: Optional, defaults to None - The service host to connect to. Standard resource path will be assumed. Format is "protocol://host:port" where ":port" is optional.

While speech_region, speech_endpoint and speech_host are optional parameters. They cannot be all empty at the same time. In that case, speech_region is set to eastus.

When connecting to Azure Cloud, parameter speech_region is enough. Here is an example config,

browser_audio:
  server_url: localhost
  asr:
    name: azure
    language: de-DE
    speech_region: germanywestcentral
  tts:
    name: azure
    language: de-DE
    voice: de-DE-KatjaNeural
    speech_region: germanywestcentral

Others

Looking for integration with a different ASR service? You can create your own custom ASR component.

Text To Speech (TTS)

This section describes the supported integrations with Text To Speech (TTS) services.

Azure TTS

The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:

credentials.yml
browser_audio:
  # ... other configuration
  tts:
    name: azure

Configuration parameters

language: Optional, defaults to en-US - The language code for the text-to-speech conversion. (See Azure documentation for a list of languages and voices).
voice: Optional, defaults to en-US-JennyNeural - The voice to be used for the text-to-speech conversion. Voice defines the specific characteristic of the voice, such as speaker's gender, age and speaking style.
timeout: Optional, defaults to 10 - The timeout duration in seconds for the text-to-speech request.
speech_region: Optional, defaults to None - The region identifier for the Azure Speech service. Ensure that the region matches the region of your subscription.
endpoint: Optional, defaults to None - The service endpoint for Azure Speech service.

Cartesia TTS

Use the environment variable CARTESIA_API_KEY for Cartesia API Key. The API Key requires a Cartesia account. It can be configured in a Voice Stream channel as follows,

credentials.yml
browser_audio:
  # ... other configuration
  tts:
    name: cartesia

Configuration parameters

language: Optional, defaults to en - The language code for the text-to-speech conversion.
voice: Optional, defaults to 248be419-c632-4f23-adf1-5324ed7dbf1d - The id of the voice to use for text-to-speech conversion. The parameter will be passed to the Cartesia API as "voice": {"mode": "id","id": "VALUE"}
timeout: Optional, defaults to 10 - The timeout duration in seconds for the text-to-speech request.
model_id: Optional, defaults to sonic-english - The model ID to be used for the text-to-speech conversion.
version: Optional, defaults to 2024-06-10 - The version of the model to be used for the text-to-speech conversion.
endpoint: Optional, defaults to https://api.cartesia.ai/tts/sse - The endpoint URL for the Cartesia API.

Deepgram TTS

Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can request a key from Deepgram. It can be configured in a Voice Stream channel as follows:

credentials.yml
browser_audio:
  # ... other configuration
  tts:
    name: deepgram

Configuration parameters

Deepgram does not use the parent class parameters of language or voice as each model is uniquely identified using the format [modelname]-[voicename]-[language].

model_id: Optional, defaults to aura-2-andromeda-en - The list of available options can be found in Deepgram Documentation.
endpoint: Optional, defaults to wss://api.deepgram.com/v1/speak - The endpoint URL for the Deepgram API.
timeout: Optional, defaults to 30 - The timeout duration in seconds for the text-to-speech request.

Others

Looking for integration with a different TTS service? You can create your own custom TTS component.

Custom ASR

You can implement your own custom ASR component as a Python class to integrate with any third-party speech recognition service. A custom ASR component must subclass the ASREngine class from rasa.core.channels.voice_stream.asr.asr_engine.

Your custom ASR component will receive audio in the RasaAudioBytes format and may need to convert it to your service's expected format.

Required Methods

Your custom ASR component must implement the following methods:

open_websocket_connection(): Establish a websocket connection to your ASR service
from_config_dict(config: Dict): Class method to create an instance from configuration dictionary
signal_audio_done(): Signal to the ASR service that audio input has ended
rasa_audio_bytes_to_engine_bytes(chunk: RasaAudioBytes): Convert Rasa audio format to your engine's expected format
engine_event_to_asr_event(event: Any): Convert your engine's events to Rasa's ASREvent format
get_default_config(): Static method that returns the default configuration for your component

Optional Methods

You may also override these methods as needed:

send_keep_alive(): Send keep-alive messages to maintain the connection. The default implementation is only a pass statement.
close_connection(): Custom cleanup when closing the connection. Default implementation is as follows,

async def close_connection(self) -> None:
  if self.asr_socket:
    await self.asr_socket.close()

ASR Events

Your engine_event_to_asr_event method should return appropriate ASREvent objects:

UserIsSpeaking(transcript): For interim/partial transcripts while the user is speaking
NewTranscript(transcript): For final transcripts when the user has finished speaking

See Configuration for details on how to configure your custom ASR component.

Example Implementation

Here's an example based on the Deepgram implementation structure:

custom_asr.py
import json
import os
from dataclasses import dataclass
from typing import Any, Dict, Optional
from urllib.parse import urlencode

import websockets
from websockets.legacy.client import WebSocketClientProtocol

from rasa.core.channels.voice_stream.asr.asr_engine import ASREngine, ASREngineConfig
from rasa.core.channels.voice_stream.asr.asr_event import (
    ASREvent,
    NewTranscript,
    UserIsSpeaking,
)
from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes

@dataclass
class MyASRConfig(ASREngineConfig):
    api_key: str = ""
    endpoint: str = "wss://api.example.com/v1/speech"
    language: str = "en-US"

class MyASR(ASREngine[MyASRConfig]):
    required_env_vars = ("MY_ASR_API_KEY",)  # Optional: required environment variables
    required_packages = ("my_asr_package",)  # Optional: required Python packages

    def __init__(self, config: Optional[MyASRConfig] = None):
        super().__init__(config)
        self.accumulated_transcript = ""

    async def open_websocket_connection(self) -> WebSocketClientProtocol:
        """Connect to the ASR system."""
        api_key = os.environ["MY_ASR_API_KEY"]
        headers = {"Authorization": f"Bearer {api_key}"}
        return await websockets.connect(
            self._get_api_url_with_params(),
            extra_headers=headers
        )

    def _get_api_url_with_params(self) -> str:
        """Build API URL with query parameters."""
        query_params = {
            "language": self.config.language,
            "encoding": "mulaw",
            "sample_rate": "8000",
            "interim_results": "true"
        }
        return f"{self.config.endpoint}?{urlencode(query_params)}"

    @classmethod
    def from_config_dict(cls, config: Dict) -> "MyASR":
        """Create instance from configuration dictionary."""
        asr_config = MyASRConfig.from_dict(config)
        return cls(asr_config)

    async def signal_audio_done(self) -> None:
        """Signal to the ASR service that audio input has ended."""
        await self.asr_socket.send(json.dumps({"type": "stop_audio"}))

    def rasa_audio_bytes_to_engine_bytes(self, chunk: RasaAudioBytes) -> bytes:
        """Convert Rasa audio format to engine format."""
        # For most services, you can return the chunk directly
        # since it's already in mulaw format
        return chunk

    def engine_event_to_asr_event(self, event: Any) -> Optional[ASREvent]:
        """Convert engine response to ASREvent."""
        data = json.loads(event)
        
        if data.get("type") == "transcript":
            transcript = data.get("text", "")
            
            if data.get("is_final"):
                # Final transcript - user finished speaking
                full_transcript = self.accumulated_transcript + " " + transcript
                self.accumulated_transcript = ""
                return NewTranscript(full_transcript.strip())
            elif transcript:
                # Interim transcript - user is still speaking
                return UserIsSpeaking(transcript)
                
        return None

    @staticmethod
    def get_default_config() -> MyASRConfig:
        """Get default configuration."""
        return MyASRConfig(
            endpoint="wss://api.example.com/v1/speech",
            language="en-US"
        )

    async def send_keep_alive(self) -> None:
        """Send keep-alive message if supported by your service."""
        if self.asr_socket is not None:
            await self.asr_socket.send(json.dumps({"type": "keep_alive"}))

This structure allows you to integrate any speech recognition service with Rasa's voice capabilities while maintaining compatibility with the existing voice stream infrastructure.

Custom TTS

You can implement your own custom TTS component as a Python class to integrate with any third-party text-to-speech service. A custom TTS component must subclass the TTSEngine class from rasa.core.channels.voice_stream.tts.tts_engine.

Your custom TTS component must output audio in the RasaAudioBytes format and convert it using the engine_bytes_to_rasa_audio_bytes method.

Required Methods

Your custom TTS component must implement the following methods:

synthesize(text: str, config: Optional[T]): Generate speech from text, returning an async iterator of RasaAudioBytes chunks
engine_bytes_to_rasa_audio_bytes(chunk: bytes): Convert your engine's audio format to Rasa's audio format
from_config_dict(config: Dict): Class method to create an instance from configuration dictionary
get_default_config(): Static method that returns the default configuration for your component

Optional Methods

You may also override these methods as needed:

close_connection(): Custom cleanup when closing connections (e.g., closing websockets or HTTP sessions). Default implementation does not do anything.

See Configuration for details on how to configure your custom TTS component.

Example Implementation

Here's an example based on the Deepgram TTS implementation structure:

custom_tts.py
import os
from dataclasses import dataclass
from typing import AsyncIterator, Dict, Optional
from urllib.parse import urlencode

import aiohttp
from aiohttp import ClientTimeout, WSMsgType

from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes
from rasa.core.channels.voice_stream.tts.tts_engine import (
    TTSEngine,
    TTSEngineConfig,
    TTSError,
)

@dataclass
class MyTTSConfig(TTSEngineConfig):
    api_key: str = ""
    endpoint: str = "wss://api.example.com/v1/speak"
    model_id: str = "en-US-standard"

class MyTTS(TTSEngine[MyTTSConfig]):
    session: Optional[aiohttp.ClientSession] = None
    required_env_vars = ("MY_TTS_API_KEY",)  # Optional: required environment variables
    required_packages = ("aiohttp",)  # Optional: required Python packages
    ws: Optional[aiohttp.ClientWebSocketResponse] = None

    def __init__(self, config: Optional[MyTTSConfig] = None):
        super().__init__(config)
        timeout = ClientTimeout(total=self.config.timeout)

    @staticmethod
    def get_request_headers(config: MyTTSConfig) -> dict[str, str]:
        """Build request headers with authentication."""
        api_key = os.environ["MY_TTS_API_KEY"]
        return {
            "Authorization": f"Bearer {api_key}",
        }

    async def close_connection(self) -> None:
        """Close WebSocket connection if it exists."""
        if self.ws and not self.ws.closed:
            await self.ws.close()
            self.ws = None

    def get_websocket_url(self, config: MyTTSConfig) -> str:
        """Build WebSocket URL with query parameters."""
        base_url = config.endpoint
        query_params = {
            "model": config.model_id,
            "language": config.language,
            "voice": config.voice,
            "encoding": "mulaw",
            "sample_rate": "8000",
        }
        return f"{base_url}?{urlencode(query_params)}"

    async def synthesize(
        self, text: str, config: Optional[MyTTSConfig] = None
    ) -> AsyncIterator[RasaAudioBytes]:
        """Generate speech from text using WebSocket TTS API."""
        config = self.config.merge(config)
        headers = self.get_request_headers(config)
        ws_url = self.get_websocket_url(config)

        try:
            self.ws = await self.session.ws_connect(
                ws_url,
                headers=headers,
                timeout=float(self.config.timeout),
            )
            
            # Send text to synthesize
            await self.ws.send_json({
                "type": "Speak",
                "text": text,
            })
            
            # Signal that we're done sending text
            await self.ws.send_json({"type": "Flush"})
            
            # Stream audio chunks
            async for msg in self.ws:
                if msg.type == WSMsgType.BINARY:
                    # Binary data is the raw audio
                    yield self.engine_bytes_to_rasa_audio_bytes(msg.data)
                elif msg.type == WSMsgType.TEXT:
                    # Check if stream is complete
                    if "Close" in msg.data or "Flushed" in msg.data:
                        break
                elif msg.type in (WSMsgType.CLOSED, WSMsgType.ERROR):
                    break

            # Send close message
            if self.ws and not self.ws.closed:
                await self.ws.send_json({"type": "Close"})
        except Exception as e:
            raise TTSError(f"Error during TTS synthesis: {e}")
        finally:
            # Ensure connection is closed
            await self.close_connection()

    def engine_bytes_to_rasa_audio_bytes(self, chunk: bytes) -> RasaAudioBytes:
        """Convert the generated TTS audio bytes into Rasa audio bytes."""
        # If your service returns audio in mulaw format already, return directly
        return RasaAudioBytes(chunk)
        
        # If your service returns audio in a different format (e.g., PCM linear),
        # you'll need to convert it. For example, to convert from linear PCM:
        # import audioop
        # mulaw_data = audioop.lin2ulaw(chunk, 2)  # 2 = 16-bit samples
        # return RasaAudioBytes(mulaw_data)

    @staticmethod
    def get_default_config() -> MyTTSConfig:
        """Get default configuration."""
        return MyTTSConfig(
            endpoint="wss://api.example.com/v1/speak",
            model_id="en-US-standard",
            language="en-US",
            voice="female-1",
            timeout=30,
        )

    @classmethod
    def from_config_dict(cls, config: Dict) -> "MyTTS":
        """Create instance from configuration dictionary."""
        return cls(MyTTSConfig.from_dict(config))

Configuration for Custom Components

To use a custom ASR or TTS component, you need to supply credentials for it in your credentials.yml file. The configuration should contain the module path of your custom class and any required configuration parameters.

The module path follows the format path.to.module.ClassName. For example:

A class MyASR in file addons/custom_asr.py has module path addons.custom_asr.MyASR
A class MyTTS in file addons/custom_tts.py has module path addons.custom_tts.MyTTS

Custom ASR Configuration Example

credentials.yml
browser_audio:
  # ... other configuration
  asr:
    name: addons.custom_asr.MyASR
    api_key: "your_api_key"
    endpoint: "wss://api.example.com/v1/speech"
    language: "en-US"
    # any other custom parameters your ASR needs

Custom TTS Configuration Example

credentials.yml
browser_audio:
  # ... other configuration
  tts:
    name: addons.custom_tts.MyTTS
    api_key: "your_api_key"
    endpoint: "wss://api.example.com/v1/speak"
    language: "en-US"
    voice: "en-US-JennyNeural"
    timeout: 30
    # any other custom parameters your TTS needs

Any custom parameters you define in your configuration class (e.g., MyASRConfig or MyTTSConfig) can be passed through the credentials file and will be available in your component via self.config.

Audio Format​

Automatic Speech Recognition (ASR)​

Deepgram​

Configuration parameters​

Azure​

Configuration parameters​

Others​

Text To Speech (TTS)​

Azure TTS​

Configuration parameters​

Cartesia TTS​

Configuration parameters​

Deepgram TTS​

Configuration parameters​

Others​

Custom ASR​

Required Methods​

Optional Methods​

ASR Events​

Example Implementation​

Custom TTS​

Required Methods​

Optional Methods​

Example Implementation​

Configuration for Custom Components​

Custom ASR Configuration Example​

Custom TTS Configuration Example​

Audio Format

Automatic Speech Recognition (ASR)

Deepgram

Configuration parameters

Azure

Configuration parameters

Others

Text To Speech (TTS)

Azure TTS

Configuration parameters

Cartesia TTS

Configuration parameters

Deepgram TTS

Configuration parameters

Others

Custom ASR

Required Methods

Optional Methods

ASR Events

Example Implementation

Custom TTS

Required Methods

Optional Methods

Example Implementation

Configuration for Custom Components

Custom ASR Configuration Example

Custom TTS Configuration Example