I’m new to Convai, and am interested in deploying it in an upcoming VR training project.
I would like to know if it supports native audio processing for speech that bypasses the need to convert audio to text first before being processed by the engine to generate a textual response, which then gets converted back to audio? (much like ChatGPT 4o Advanced Voice Mode)
Also, does the vision capability include the ability to visually understand the virtual scene(not some video feed)? (somewhat analogously like ChatGPT’s ability to understand what it sees via the video feed)
I’m not entirely sure I fully understood your first question, but if you are referring to a more natural, continuous “hands-free” style of conversation, then yes. We have a HandsFree conversation mode available in beta for supported SDKs. You can find the relevant beta docs here:
For Vision: you are free to choose the input source. It can be a webcam feed, a rendered texture or camera view from inside your game or VR scene, or any other image/video stream you provide. From Convai’s perspective, it just receives the frames you send it, so using a virtual in-scene camera is absolutely possible.
Thanks for getting back!
My 1st question might have been poorly phrased. What I meant to ask was if the speech from the NPC could be interrupted mid-sentence(much like ChatGPT 4o), then continue seamlessly taking into account the altered context? In the same vein, does the response take into account of the user’s emotional tone(which would imply native audio processing instead of purely textual processing as in multimodal ChatGPT)?
With regards to the 2nd question, I take it that the vision capability is indeed able to recognize objects in the virtual world? Say in a VR setup where I’m represented as a full body-tracked 3D avatar, and I’m conversing with a NPC that supposedly sees me, can it actually recognize my body language or facial expressions in real-time? Or see me pointing to a chair while asking about its color, and then saying something like “it’s blue”? (again such ability is already demonstrated in chatbots like ChatGPT and Gemini via live video feed)
On your first question:
Yes, with Hands Free you can interrupt the NPC while it is speaking and let it respond again based on the updated context, similar to what you describe. However, the system currently does not reason over the user’s emotional tone from the raw audio. It mainly uses the transcribed text, not prosody or sentiment directly from the waveform.
On Vision:
Yes, Vision can understand what is visible in the frames you send it.
For experimenting with this kind of setup, I recommend trying our beta package.
Glad to know that Convai can indeed demonstrate visual understanding of the virtual scene.
I remember reading somewhere on your website about Convai detecting emotions. I presume this would be derived from textual cues in the user’s speech or perhaps visual cues in the user’s body language then?
Regarding the speech capability, are there plans to consider native audio processing, particularly to enable faster interruption response time and potentially real-time analysis of the user’s emotional tone via the audio stream?