I’m integrating Convai with Unity and using a VAD-based audio pipeline that works fine on its own.
The issue appears when I add an audio prefix:
-
At the start of the utterance, I push a short PCM clip that says “translate”.
-
After that, VAD continues and streams mic audio as usual.
Without the prefix → everything works (full sentences are transcribed).
With the prefix + VAD → the server seems finalizes after the prefix, and in recent memory I only see "Translate.".
Since it is a black box and I can’t see exactly what it receives internally, I’d like to understand why this cutoff happens and what the recommended way is to prepend a prefix without losing the rest of the utterance.
I checked my code and the frequency is 16000 for both mic and prefix.
________________
My goal is to have a character that translates between two languages.
If I rely only on the objective of the section, the model sometimes misunderstands — for example, if the input is a question, it may try to answer it itself instead of translating.
When I prepend the word “translate” at the beginning of the prompt, it works reliably.
With text input, this works fine.
But to reduce latency, I need to do this through voice — and since I can’t send text + voice in one stream, I tried using an audio prefix that says “translate” before the mic audio.