Multimodal data for Knowledge Bank beyond just text?

Hi there,

I’m wondering if there are future plans to widen the scope for the kinds of data that the Knowledge Bank can ingest beyond just text, to include multimodal data like images/videos and audio?

I’m currently experimenting with the live audio beta package(using Gemini 2.5 Flash Live as foundation model) that processes live audio/video stream, and feel that the Knowledge Bank also ought to be able to accept visual and audio data. Besides, much of the info in the real world is simply too complex/subtle to be meaningfully reduced to text.

So any thoughts on this?

Cheers,

Melvin Eng.

Hi again,

Would be great if Convai could share some comments on this? :slightly_smiling_face:

Additionally, I read somewhere in the forum that PDF uploads are accepted, which begs the question as to whether Convai will actually understand the images contained in them?

Cheers,

Melvin Eng.

At the moment, we do not have plans for that.

Hi Kaan,

Thanks for sharing :grinning_face:

What about PDF uploads? I understand that PDF uploads to the Knowledge Bank is available for enterprise users, and also read somewhere in the forum that this feature will be rolled out soon to other tier users(though it’s still unavailable now).

So I’m most curious as to whether images in the PDFs will be processed within the same context as the text that references them? (thus forming a coherent multimodal image-textual memory where there is correlation between the 2)

Regarding the rollout of PDF support for Knowledge Bank to other tier users, may I know if this will be coming soon?

Cheers!

Melvin Eng.