Hi,
I’m currently experiencing issues with the microphone input and speech-to-text (STT) behavior while using Convai in Unreal Engine 5 (both in Editor and runtime).
I have implemented Push-To-Talk (PTT) using both a UI widget button and the default [T] key. In both cases, the system correctly detects when I’m speaking, but the transcription behavior is not working as expected.
The main issue is that I speak in Spanish, but the STT transcription is always processed in English. This leads to inaccurate results and loss of context.
Here are the specific problems I’m encountering:
- The transcription attempts to convert Spanish speech into English, resulting in incorrect text.
- While I’m speaking, the transcription continuously updates and replaces previous words instead of appending them.
- As a result, parts of the sentence are lost. By the time I finish speaking, only a few words remain (sometimes just 1–3 words).
- The system generates responses based on this incomplete input, which leads to incorrect or irrelevant replies.
This behavior occurs both when using a UI button for PTT and when using the [T] key directly.
It seems like the STT is running in a streaming/partial transcription mode, but without proper language detection or proper finalization handling.
I would like to better understand how the STT system is expected to behave in multilingual scenarios.
Questions:
- Does the STT support automatic language detection or multilingual input?
- If so, how can it be properly configured?
- Is the current behavior expected when using streaming transcription?
- Can the transcription mode be configured to finalize input only after the user finishes speaking, instead of continuously replacing it?
I would prefer not to force a single language, but rather ensure proper multilingual or auto-detection support.
This issue makes it very difficult to maintain correct conversational context.
Thanks in advance.