Hi Convai team and community,
I’m working on a Unity-based immersive installation using Convai characters,
and I’d like some clarification on the recommended workflow for more complex,
real-world setups.
Context:
- The experience is fully voice-driven (no buttons, no UI interaction).
- One central display hosts the Convai avatar and voice interaction.
- Two additional displays (left/right) show video content triggered by the
user’s spoken choice. - The avatar acts as an introduction and guide, then remains silent while
videos are playing.
Current challenges:
-
above all, how can I connect the avatar, once the user responds, and start the reference video?
-
It’s not entirely clear which parts of the Convai SDK are intended for
custom interaction logic (e.g. triggering Unity events like VideoPlayers). -
The Actions system appears partially documented and, depending on the SDK
version, some components (e.g. Action Settings) are not exposed or usable. -
In practice, this makes it difficult to understand whether:
- Convai Actions are meant for custom Unity logic, or
- They are limited to predefined demo behaviors (movement, gestures, etc.).
At the moment, the most stable approach seems to be:
- Letting Convai handle intent recognition and dialogue flow.
- Emitting simple, deterministic output (tags or keywords).
- Letting Unity handle all application logic (video playback, display
switching, audio routing).
Questions:
- For voice-only installations, is this “Convai for intent + Unity for logic”
approach the recommended one? - Are Convai Actions intended to support arbitrary Unity events, or are they
currently limited to specific built-in behaviors? - Is there an official reference architecture for multi-display or
installation-style projects (non-game, non-UI-driven)?
Any guidance on best practices or intended workflows would be very helpful,
especially for exhibition and museum-style applications.
Thanks in advance!![]()