Glossary · discipline

Multimodal Search

Written by Luke Marinovic, Founder of UnderCurrent Automations · Melbourne

Published 25 May 2026 · Updated 25 May 2026

Multimodal search accepts images, voice, video, files and browser tabs as input, not just text. Google I/O 2026 called it the biggest Search Box upgrade in 25 years, and most Australian SMB photo libraries are sitting on huge content equity that is currently invisible to it.

The framing matters. Google's I/O 2026 Search announcement bundled image input, voice input, file drop and Chrome tabs as input into one Gemini-backed query layer. The Search Box is no longer a text field. It is a request channel that takes whatever the user has in front of them and asks the AI to make sense of it.

Picture a Footscray panel beater with 600 before-and-after photos sitting in a WordPress gallery. Zero alt text. Filenames like IMG_4471.jpg. EXIF stripped on upload. No ImageObject schema. A driver in Yarraville opens Google Lens, points the camera at the rear quarter panel of their 2018 Mazda 3, and asks "who fixes this near me". The panel beater has the exact work in their gallery. They are invisible to the query because the image layer carries no machine-readable signal.

The fix is concrete. Semantic filenames (mazda-3-rear-quarter-panel-repair-footscray.jpg). Descriptive alt text that names the vehicle, panel, suburb and outcome. ImageObject JSON-LD with caption, contentLocation, creator and license. Retain EXIF where geographic and temporal data adds trust. For video, ship VideoObject schema with a transcript and uploadDate. Google Search Central's image SEO docs and the image license metadata guidance cover the property list. The generic advice of "add alt text" stops short of what a multimodal query actually reads.

This is why image SEO and video SEO just moved from nice-to-have to load-bearing. Tradies with before-and-after libraries, beauty portfolios, real estate listings, gym facility shots, allied health treatment rooms, every photo-heavy SMB vertical has the raw assets. They just have not exposed them in a way Gemini can parse.

UnderCurrent Automations treats image and video markup as a first-class layer in SEO & AI visibility, alongside schema markup and the AI-search optimisation work that gets pages cited inside Google AI Mode. Photos that were dead weight last year are now query surface area.

Frequently asked questions

Is alt text still enough for multimodal search?

No. Alt text is the 2014 floor. A 2026 image needs ImageObject JSON-LD with caption, contentLocation and creator, a semantic filename, intact EXIF where it adds geographic or temporal signal, and a descriptive surrounding paragraph. Multimodal queries match on the full bundle, not the alt attribute alone.

Do I need to do anything different for video?

Yes. Ship VideoObject schema with a transcript, thumbnailUrl, uploadDate and duration. Multimodal search lifts video frames and spoken content into answers, so a clip without a transcript and without VideoObject markup is treated as opaque. Most service-business video galleries are currently invisible for this reason.

Will multimodal search replace text search?

No, it widens the input layer. People still type the majority of queries, but a growing share now point a camera, drag in a Chrome tab, or speak. The site that wins is the one machine-readable across all input modes, not the one that picks a favourite.

Sources

Related