Forget about written prompts. Artificial intelligence is now multimodal. This means users can interact with it through voice, like with any voice assistant, or video because the AI can recognize everything in front of the camera. Project Astra is Google’s own take on multimodal AI. At Xataka On, we’ve had a chance to try its experimental version while at Google I/O 2024.
Project Astra will be available on smartphones and the Gemini app by the end of the year. However, Google’s future multimodal AI assistant is already working perfectly. While the multimodal version of Gemini 1.5 Pro is ready, Google still needs to polish some details based on the demo we’ve seen.
Project Astra: An Impressive Assistant With Improvements to Come Later This Year
Project Astra is like a supercharged version of Google Lens. When it focuses its camera on an object, the AI recognizes it instantly and provides real-time answers based on what it sees. In our tests, the setup included a room with several objects, a screen, and a ceiling-mounted camera pointed just below the screen.
During the demonstration, we were able to select different stuffed animals and toys —including a dinosaur, a donut, a loaf of bread, and a musical instrument—and create stories with them. We could ask the AI questions about these objects, and it would provide answers. If we introduced a new object, the AI would immediately recognize it and provide information about it.
Project Astra provides real-time answers, which is truly impressive given that the possibilities are endless. The AI can recognize an object when placed in its line of view then can detect when we place another one next to it. The Storyteller feature allows Astra to create narratives based on the objects it identifies.
For instance, we could ask it to identify the largest object on the table, share anecdotes about them, describe their physical properties, or discuss colors. There are as many ideas as potential prompts.
One of Project Astra’s features is its ability to “remember” information. In the official demo video, it’s impressive to see how the person can ask where they left their glasses and get an accurate response. We’ve also tested this feature and found that with Astra, we can show the AI an object, take it away, ask it other questions, and then prompt it to remember the initial object we showed it.
According to Google, the AI only “remembers” what it has seen for the duration of the open session. This brings up a question about processing. The current Project Astra demos are designed for short sessions that last a few minutes. As such, maintaining the AI's response speed becomes considerably more difficult the longer it's being used.
While working prototypes of Astra running on a Pixel 8 Pro are already available, it won’t be technically integrated into the Gemini app until later this year. At that point, we’ll need to see to what extent the experience doesn’t become too slow if the session lasts too long.
Another enjoyable feature of Project Astra is that it lets you play Pictionary with the AI. We haven’t tested whether Project Astra can recognize abstract concepts like dignity, but it accurately identified the Jaws movie when we drew a fin and also got Titanic when we drew a ship and an iceberg.
It’s interesting to observe that Project Astra communicates and asks for clarification as you draw. However, the latency level could be higher during this activity. Although Astra speaks when it recognizes a new relevant element on the screen, it needs to be really obvious for it to get it.
Project Astra represents the evolution we've been all be waiting for from Google Assistant. This AI is capable of providing feedback when we engage with it on any topic, and we can use our phone's camera to educate it. This natural interaction is what gives it a sci-fi feel.
Google Lost Its Ability to Surprise Us a Long Time Ago
Unlike OpenAI’s GPT-4o demos, Google has set Astra’s voice as the default, which has a more instructive and less seductive tone. This is something I personally appreciate. It doesn’t come close to the one in Spike Jonze’s Her, but it’s equally useful. Google's Astra demonstration is remarkable for the significant innovation it represents, but ultimately, it seems that soon we’ll all have it on our phones, and it’ll become commonplace.
Compared to GPT-4o, Project Astra lacks the ability to surprise us. In practice, it’s a multimodal AI and performs similarly, but the examples chosen in the demo and the response cadence don’t have as much impact as what we saw with OpenAI’s model. While GPT-4o boasts about an average latency of 320 milliseconds, Google hasn't shared any data. All in all, it wouldn't be surprising if the competition for speed goes one way or the other depending on how much you’re willing to pay.
The bottom line is that the format we’ve been able to test Project Astra in is not the most suitable for exploring all its possibilities. At this year’s Google I/O, company executives hinted at the arrival of glasses in the future. After seeing this demo, it’s clear to me that this format fits is perfectly suited to work with multimodal AI, which is ready to surprise us in the same way that chatbots did less than two years ago.