Does Convai support native audio processing with Gemini 2.5 Flash Live

Hi there,

I’d selected Gemini 2.5 Flash Live as my avatar’s foundational model(which supposedly processes audio natively instead of having to convert between text and speech every time, with the additional ability to analyze the emotional cadence of the user’s speech).

However, the speech generation is still laggy, and it clearly lacks the ability to recognize the emotional tone of my speech. Am I missing something?

Any tips and advice would be greatly appreciated! :slightly_smiling_face:

Regards,

Melvin Eng.

Hello,

Where do you test it?

Hi there,

I’m using a sample scene that came with the Convai package, running in Unity 6 on Windows 11.

I’d created a character in Playground and selected Gemini 2.5 Flash Live as the foundational model, somewhat expecting the speech latency to improve as well as some ability to recognize the user’s emotional tone. However the latency was just as bad and no ability to distinguish the user’s emotional tone was evident.

FYI, I’m currently testing out Convai in Unity using the free plan.

My character ID: 58698aac-125c-11f1-9b66-42010a7be02c

Would appreciate if you could shed some light on this?

Sincere thanks!

Melvin Eng.

Which Convai Package do you use?

Hi,

As mentioned above, I’m currently using the Free Plan to test out Convai’s capabilities first.

Regards,

Melvin.

What version of the Convai plugin are you using?

Hi,

Oh, I’m using Convai for Unity v3.3.0.

Do let me know if there is any other info you need.

Cheers,

Melvin.

You are using the old plugin. Please try the new beta package instead. It includes an MR sample scene you can use as a reference.

Hi again,

Thanks for the speedy response.

Right, I’ll update my installed package asap.

Btw does the updated package take advantage of native audio processing as afforded by models like Gemini 2.5 Flask Live then?

Regards,

Melvin.

Make sure to create a new project and yes.

Hi again,

I’ve updated the Convai Unity plug-in to v3.3.3, and selected Gemini 2.5 Flash Live as my avatar’s foundation model and tested it, but my avatar still lacks the ability to discern sounds made by the user(which indicates that the processing is still text-based as opposed to native audio processing?)

I’d tested with my Gemini chatbot and it does demonstrate said ability.

Specifically, I’m wondering if Convai v3.3.3 actually supports Gemini 2.5 Flash Live via Gemini Live API? (see https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-live-api )

Or am I missing something here?

Cheers,

Melvin.

Hi again Kaan,

Apologies, I somehow missed out the bit about the beta package.

Yes I finally got it to work. And the visual perception works along with audio perception as well. Latency is also better now.

Really looking forward to the upcoming lip-sync feature, as I’m planning to implement an embodied AI avatar that speaks with lip-synced animation as well as the ability to look around and understand the virtual scene.

Cheers!

Melvin Eng.

1 Like

Hi Kaan,

So any timeline for lip-sync support?

Cheers,

Melvin Eng.

I can’t share an exact timeline, but lip sync support is coming very soon. It’s currently in testing.

Hi again,

Great! Very glad to hear that it’s coming soon :grinning_face:

Cheers,

Melvin Eng.

1 Like