Balancing latency and safety in NPCs

it · August 7, 2025, 2:11pm

Hi all,

I’m using Convai Unity SDK 3.3.0 on Unity 6.
Free plan, LLM set to Claude 3.5 / Claude 3.7 / gpt 4.1-mini .

Observed behaviour

Average latency ≈ 20 s before the NPC starts speaking.
~1 in 10 replies stalls after “Por supuesto…” (“Of course…”) and stops.

Questions

Which LLM do you recommend for real-time NPCs that balances safety and faster starts?
Do paid plans (Business / Enterprise) lower model latency or reduce these cut-off errors? Any concrete numbers?
Work-arounds you’ve found effective—streaming partial text, shorter prompts, fallback models, etc.—to mask or mitigate the delay and incomplete answers?

Thanks for any pointers!

K3 · August 7, 2025, 6:15pm

Hello @it,

Welcome to the Convai Developer Forum!

20 second response time and the audio cutting off are definitely not expected behavior. If you could share your character ID, we’d be happy to investigate the issue further for you.

it · October 21, 2025, 12:16pm

Hi K3, sorry for the delay. I’ve kept testing with Convai Unity SDK 3.3.0 on Unity 6, single text emission with a streaming (gRPC) response, Free plan.

Empirical findings:
• Llama-3-70B starts faster; often <5 s depending on prompt length and KB.
• GPT-4o (and 4.1-mini) follows instructions better but sometimes takes 10–12 s to start. Is that region/route latency, or are there recommended settings to reduce TTFB in this flow?
• The typical “Of course…” truncation appears when the response would be long and the generated audio exceeds ≈6 MB: I receive the first chunk and then a close with an error. My understanding is the LLM cannot know final audio size, so prompting cannot reliably force fragmentation.

it · October 21, 2025, 3:04pm

Following our progress, new questions have arisen about concurrency, the TTS limit (~6 MB), and latency across LLMs.

Product context
We are an AI chat with an avatar that recommends itineraries, among other things. Many answers are long, so we need fast starts, stable text+audio delivery, and a reliable way to chunk long outputs without TTS cutoffs.

Setup
• Unity 6 (URP) + Convai Unity SDK 3.3.0 + Meta Quest 3.
• Free plan.
• Flow: a single text emission from the client; Convai returns text + audio via streaming (gRPC).

Concurrency test (theoretical = 1)

• 6 simultaneous sessions with the same characterID and API key: 2 via Studio (owner), 2 via shared link, 2 via Quest 3 APK.
• All 6 received correct audio. Target: peaks of ~520 concurrent clients.

Observed latency

• GPT-4o shows 5–15 seconds latency with correct responses.
• Llama3-70B shows 5–10 seconds latency with less correct responses.

Questions on concurrency (priority)

Exactly when is a slot consumed and released: on call acceptance, on first chunk, or at audio EOR? Does TTS keep the slot busy until audio ends?
Scope of the limit: per API key/account or per characterID? Do Studio sessions share the pool with Unity sessions?
Is there queuing or immediate rejection? If queuing: FIFO?, size and timeout. If rejection: typical server status/trailers when the limit is exceeded.
Telemetry: are there headers/trailers or an endpoint to see server timestamps (accepted/open/first-byte/EOR) and active sessions?

Questions on TTS, latency, and LLMs

Official TTS limits: we see truncation when generated audio is long and an error appears for exceeding ≈6 MB. Can you confirm official limits (size/duration) and strategies to chunk long responses server-side or client-side? We understand the LLM does not know the final audio size. If there is any way to control it, we need to know.
TTFB: configuration recommendations to minimize it with any LLM and TTS: region, max tokens, temperature, minimal system prompt, keepalive/deadline, and audio parameters.
Format adherence: do you support structured output / JSON schema / function-calling / guardrails to enforce formats with Llama3-70B and GPT-4o? Any recommended system/prompt templates?
EU region: if an EU region exists, what is the typical latency impact?

Thanks for any guidance, mainly on concurrency.

K3 · November 9, 2025, 11:21am

Response times depend on multiple factors including the chosen LLM, voice provider, character backstory, Narrative Design, and Knowledge Bank configuration. We’re working on a new plugin with lower latency and features like hands-free conversation, we recommend trying that once released.

For concurrency limits and how they’re applied, please refer to the Pricing page FAQ for the details: https://convai.com/pricing

Topic		Replies	Views
Troubleshooting latency issues across all characters Core AI Settings conversation-issues	29	103	May 7, 2026
How can I reduce response times? Questions unity	11	335	January 20, 2025
Very slow response Conversation Issues conversation-issues	37	759	July 3, 2025
Is there any way of accelerating the latency between question and anwser ? thanks Questions unreal-engine	32	147	November 25, 2024
AI Character Response Delay Issue in Unity Conversation Issues conversation-issues	3	74	November 5, 2025

Balancing latency and safety in NPCs

Related topics