I’m using Convai Unity SDK 3.3.0 on Unity 6.
Free plan, LLM set to Claude 3.5 / Claude 3.7 / gpt 4.1-mini .
Observed behaviour
Average latency ≈ 20 s before the NPC starts speaking.
~1 in 10 replies stalls after “Por supuesto…” (“Of course…”) and stops.
Questions
Which LLM do you recommend for real-time NPCs that balances safety and faster starts?
Do paid plans (Business / Enterprise) lower model latency or reduce these cut-off errors? Any concrete numbers?
Work-arounds you’ve found effective—streaming partial text, shorter prompts, fallback models, etc.—to mask or mitigate the delay and incomplete answers?
20 second response time and the audio cutting off are definitely not expected behavior. If you could share your character ID, we’d be happy to investigate the issue further for you.
Hi K3, sorry for the delay. I’ve kept testing with Convai Unity SDK 3.3.0 on Unity 6, single text emission with a streaming (gRPC) response, Free plan.
Empirical findings:
• Llama-3-70B starts faster; often <5 s depending on prompt length and KB.
• GPT-4o (and 4.1-mini) follows instructions better but sometimes takes 10–12 s to start. Is that region/route latency, or are there recommended settings to reduce TTFB in this flow?
• The typical “Of course…” truncation appears when the response would be long and the generated audio exceeds ≈6 MB: I receive the first chunk and then a close with an error. My understanding is the LLM cannot know final audio size, so prompting cannot reliably force fragmentation.
Following our progress, new questions have arisen about concurrency, the TTS limit (~6 MB), and latency across LLMs.
Product context
We are an AI chat with an avatar that recommends itineraries, among other things. Many answers are long, so we need fast starts, stable text+audio delivery, and a reliable way to chunk long outputs without TTS cutoffs.
Setup
• Unity 6 (URP) + Convai Unity SDK 3.3.0 + Meta Quest 3.
• Free plan.
• Flow: a single text emission from the client; Convai returns text + audio via streaming (gRPC).
Concurrency test (theoretical = 1)
• 6 simultaneous sessions with the same characterID and API key: 2 via Studio (owner), 2 via shared link, 2 via Quest 3 APK.
• All 6 received correct audio. Target: peaks of ~520 concurrent clients.
Observed latency
• GPT-4o shows 5–15 seconds latency with correct responses.
• Llama3-70B shows 5–10 seconds latency with less correct responses.
Questions on concurrency (priority)
Exactly when is a slot consumed and released: on call acceptance, on first chunk, or at audio EOR? Does TTS keep the slot busy until audio ends?
Scope of the limit: per API key/account or per characterID? Do Studio sessions share the pool with Unity sessions?
Is there queuing or immediate rejection? If queuing: FIFO?, size and timeout. If rejection: typical server status/trailers when the limit is exceeded.
Telemetry: are there headers/trailers or an endpoint to see server timestamps (accepted/open/first-byte/EOR) and active sessions?
Questions on TTS, latency, and LLMs
Official TTS limits: we see truncation when generated audio is long and an error appears for exceeding ≈6 MB. Can you confirm official limits (size/duration) and strategies to chunk long responses server-side or client-side? We understand the LLM does not know the final audio size. If there is any way to control it, we need to know.
TTFB: configuration recommendations to minimize it with any LLM and TTS: region, max tokens, temperature, minimal system prompt, keepalive/deadline, and audio parameters.
Format adherence: do you support structured output / JSON schema / function-calling / guardrails to enforce formats with Llama3-70B and GPT-4o? Any recommended system/prompt templates?
EU region: if an EU region exists, what is the typical latency impact?
Response times depend on multiple factors including the chosen LLM, voice provider, character backstory, Narrative Design, and Knowledge Bank configuration. We’re working on a new plugin with lower latency and features like hands-free conversation, we recommend trying that once released.
For concurrency limits and how they’re applied, please refer to the Pricing page FAQ for the details: https://convai.com/pricing