Problem when prepending audio prefix before VAD mic stream

I’m integrating Convai with Unity and using a VAD-based audio pipeline that works fine on its own.

The issue appears when I add an audio prefix:

  • At the start of the utterance, I push a short PCM clip that says “translate”.

  • After that, VAD continues and streams mic audio as usual.

Without the prefix → everything works (full sentences are transcribed).
With the prefix + VAD → the server seems finalizes after the prefix, and in recent memory I only see "Translate.".

Since it is a black box and I can’t see exactly what it receives internally, I’d like to understand why this cutoff happens and what the recommended way is to prepend a prefix without losing the rest of the utterance.
I checked my code and the frequency is 16000 for both mic and prefix.

________________
My goal is to have a character that translates between two languages.
If I rely only on the objective of the section, the model sometimes misunderstands — for example, if the input is a question, it may try to answer it itself instead of translating.

When I prepend the word “translate” at the beginning of the prompt, it works reliably.
With text input, this works fine.
But to reduce latency, I need to do this through voice — and since I can’t send text + voice in one stream, I tried using an audio prefix that says “translate” before the mic audio.

Hello @Fahimeh_TL,

Welcome to the Convai Developer Forum!

Unfortunately, external setups like the one you described fall outside our support scope.

Thanks.
Could you clarify the server-side rules for detecting the end of an utterance during audio streaming?
Specifically:
1)Does it expect a final end of stream signal from the client( CompleteAsync() ), or can it terminate automatically because of other reasons?
2)Are there size/time limits for a single utterance (e.g., max seconds of audio, max bytes)?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.