I’d selected Gemini 2.5 Flash Live as my avatar’s foundational model(which supposedly processes audio natively instead of having to convert between text and speech every time, with the additional ability to analyze the emotional cadence of the user’s speech).
However, the speech generation is still laggy, and it clearly lacks the ability to recognize the emotional tone of my speech. Am I missing something?
I’m using a sample scene that came with the Convai package, running in Unity 6 on Windows 11.
I’d created a character in Playground and selected Gemini 2.5 Flash Live as the foundational model, somewhat expecting the speech latency to improve as well as some ability to recognize the user’s emotional tone. However the latency was just as bad and no ability to distinguish the user’s emotional tone was evident.
FYI, I’m currently testing out Convai in Unity using the free plan.
My character ID: 58698aac-125c-11f1-9b66-42010a7be02c
Would appreciate if you could shed some light on this?
I’ve updated the Convai Unity plug-in to v3.3.3, and selected Gemini 2.5 Flash Live as my avatar’s foundation model and tested it, but my avatar still lacks the ability to discern sounds made by the user(which indicates that the processing is still text-based as opposed to native audio processing?)
I’d tested with my Gemini chatbot and it does demonstrate said ability.
Apologies, I somehow missed out the bit about the beta package.
Yes I finally got it to work. And the visual perception works along with audio perception as well. Latency is also better now.
Really looking forward to the upcoming lip-sync feature, as I’m planning to implement an embodied AI avatar that speaks with lip-synced animation as well as the ability to look around and understand the virtual scene.