This is Google’s response to OpenAI.
General-purpose AI, AI that can be used in daily life. If it were not made like this, I would be embarrassed to launch a conference now.
In the early morning of May 15, the annual “tech industry party” Google I/O Developer Conference officially opened. How many times does the 110-minute-long main Keynote mention artificial intelligence? Google did its own statistics:
Yes, AI is being talked about every minute.
The competition in generative AI has recently reached a new climax, and the content of this I/O conference will naturally revolve around artificial intelligence.
“One year ago on this stage, we first shared plans for Gemini, a native multimodal large model. It marked a new generation of I/O,” said Google CEO Sundar Pichai. “Today, we hope everyone can benefit from Gemini’s technology. These breakthrough features will enter search, images, productivity tools, Android systems and more.”
24 hours ago, OpenAI deliberately pre-empted the release of GPT-4o, shocking the world with real-time voice, video, and text interactions. Today, Google demonstrated Project Astra and Veo, which directly benchmark the current OpenAI leading GPT-4o and Sora.
We are witnessing the most high-end business war, conducted in the most simple way.
The latest version of Gemini revolutionizes the Google ecosystem
At the I/O conference, Google showed off the search capabilities powered by the latest version of Gemini.
Twenty-five years ago, Google powered the first wave of the Information Age with its search engine. Now, search engines can better answer your questions as generative AI technology evolves, taking better advantage of contextual content, location awareness, and real-time information capabilities.
Based on the latest version of the customizable Gemini model, you can ask the search engine anything you think of, or anything that needs to be done — from research to planning to imagination, and Google will take care of it all.
Sometimes you want answers quickly but don’t have the time to piece all the information together. At this time, the search engine will do the work for you through AI overview. Overviewed by Artificial Intelligence, AI can automatically visit a large number of websites to provide an answer to a complex question.
With custom Gemini’s multi-step reasoning capabilities, AI Overview will help solve increasingly complex problems. No longer do you need to break your questions into multiple searches, you can now ask the most complex questions in one go, with all the nuances and caveats you thought of.
In addition to finding the right answers or information for complex questions, search engines can work with you to create a plan step by step.
At I/O, Google highlighted the multimodal and long-text capabilities of large models. Advances in technology are making productivity tools like Google Workspace smarter.
For example, we can now ask Gemini to summarize all recent emails from the school. It will identify relevant emails in the background and even analyze attachments such as PDFs. You’ll then get a summary of key points and action items.
If you are traveling and unable to attend a project meeting, the recording of the meeting is up to an hour long. If the meeting is held on Google Meet, you can ask Gemini to introduce you to the key points. There is a group looking for volunteers and you are available that day. Gemini can help you write an email to apply.
Going a step further, Google sees more opportunities in large model Agents as intelligent systems with reasoning, planning, and memory capabilities. Applications that use Agent can “think” multiple steps in advance and work across software and systems to help you complete tasks more conveniently. This idea has been reflected in products such as search engines, and people can directly see the improvement of AI capabilities.
At least in terms of Family Bucket applications, Google is ahead of OpenAI.
Gemini family big update: Project Astra is online
Ecologically, Google has inherent advantages, but the foundation of large models is very important. For this purpose, Google has integrated the power of its own team and DeepMind. Today, Hassabis also took the stage for the first time at the I/O conference and personally introduced the mysterious new model.
In December, Google launched Gemini 1.0, its first native multi-modal model, in three sizes: Ultra, Pro, and Nano. Just a few months later, Google released a new version, 1.5 Pro, with improved performance and a context window that exceeded 1 million tokens.
Now, Google has announced a slew of updates to its Gemini line of models, including the new Gemini 1.5 Flash, Google’s lightweight model for speed and efficiency, and Project Astra, Google’s vision for the future of AI assistants. .
Currently, both 1.5 Pro and 1.5 Flash are available in public preview, with a 1 million token context window available in Google AI Studio and Vertex AI. 1.5 Pro now also offers a 2 million token context window through a waitlist to developers and Google Cloud customers using the API.
In addition, Gemini Nano has also expanded from plain text input to image input. Later this year, starting with the Pixel, Google will launch the multi-modal Gemini Nano. This means mobile phone users are able to process not only text input but also understand more contextual information such as sight, sound and spoken language.
The Gemini family welcomes a new member: Gemini 1.5 Flash
The new 1.5 Flash is optimized for speed and efficiency.
1.5 Flash is the newest member of the Gemini model family and the fastest Gemini model in the API. It is optimized for large-scale, high-volume, high-frequency tasks, with more cost-effective services and a breakthrough long context window (1 million tokens).
Gemini 1.5 Flash features strong multi-modal reasoning capabilities with groundbreaking long context windows.
1.5 Flash excels at snippets, chat applications, image and video subtitles, extracting data from long documents and tables, and more. That’s because 1.5 Pro trains it through a process called “distillation,” transferring the most basic knowledge and skills from a larger model into a smaller, more efficient model.
Gemini 1.5 Flash performance. Source https://deepmind.google/technologies/gemini/#introduction
Improved Gemini 1.5 Pro: Context window expanded to 2 million tokens
Google mentioned that more than 1.5 million developers are using the Gemini model today, and more than 2 billion product users use Gemini.
Over the past few months, in addition to expanding the Gemini 1.5 Pro context window to 2 million tokens, Google has also enhanced its code generation, logical reasoning and planning, multi-turn conversations, and audio and images with data and algorithm improvements. Comprehension.
1.5 Pro can now follow increasingly complex and detailed instructions, including those that specify production-level behavior involving roles, formats, and styles. In addition, Google also allows users to guide model behavior by setting system commands.
Now, Google has added audio understanding in the Gemini API and Google AI Studio, so 1.5 Pro can now perform inference on video images and audio uploaded in Google AI Studio. Additionally, Google is integrating 1.5 Pro into Google products, including Gemini Advanced and the Workspace app.
Gemini 1.5 Pro is priced at $3.50 per 1 million tokens.
In fact, one of the most exciting transformations at Gemini is Google Search.
Over the past year, Google Search has answered billions of queries as part of the search generation experience. Now people can use it to search in new ways, ask new types of questions, longer, more complex queries, even search using photos, and get the best information the web has to offer.
Google is about to launch the Ask Photos feature. In the case of Google Photos, the feature launched about nine years ago. Today, users upload more than 6 billion photos and videos every day. People love to use photos to search their lives. Gemini makes it easier.
Let’s say you’re paying in a parking lot and can’t remember your license plate number. Before, you could search for keywords in photos and then scroll through years of photos looking for license plates. Now, all you have to do is ask for photos.
Or, for example, you recall the early life of your daughter Lucia. Now, you can ask the photo: When did Lucia learn to swim? You can also follow up with something more complex: tell me how Lucia’s swimming is going.
Here, Gemini goes beyond a simple search and identifies different backgrounds—including different scenes such as swimming pools, oceans, and more—and photos bring everything together for easy viewing. Google is rolling out Ask Photos this summer, with more to come.
A new generation of open source large models Gemma 2
Today, Google also released a series of updates to the open source large model Gemma – Gemma 2 is here.
According to reports, Gemma 2 adopts a new architecture and aims to achieve breakthrough performance and efficiency. The newly open source model parameter is 27B.
Additionally, the Gemma family is expanding with PaliGemma, Google’s first visual language model inspired by PaLI-3.
General AI Agent Project Astra
Agents have always been a key research direction of Google DeepMind.
Yesterday, we took a look at OpenAI’s GPT-4o and were shocked by its powerful real-time voice and video interaction capabilities.
Today, DeepMind’s vision and voice interaction general AI agent project Project Astra was unveiled. This is Google DeepMind’s vision of the future AI assistant.
To be truly effective, Google says, agents need to understand and respond to the complex, dynamic real world just like humans do. They also need to absorb and remember what they see and hear to understand context and take action. Additionally, the agent needs to be proactive, teachable, and personalized so that users can talk to it naturally, without lag or delay.
Over the past few years, Google has been working to improve the way its models perceive, reason, and talk to make the speed and quality of interactions more natural.
In today’s Keynote, Google DeepMind demonstrated the interactive capabilities of Project Astra.
According to reports, Google developed an intelligent agent prototype based on Gemini, which can process information faster by continuously encoding video frames, combining video and voice input into an event timeline, and caching this information for efficient calls. .
Through the speech model, Google also enhanced the agent’s pronunciation, providing the agent with a wider range of intonations. These agents can better understand the context in which they are used and respond quickly during conversations.
Here is a brief comment: I feel that the demo released by Project Astra is much worse than the GPT-4o real-time demonstration in terms of interactive experience. Whether it is the length of response, the emotional richness of the voice, the ability to interrupt, etc., the interactive experience of GPT-4o seems to be more natural. I wonder how readers feel?
Counterattack against Sora: Release of video generation model Veo
In terms of AI-generated videos, Google announced the launch of Veo, a video generation model. Veo is capable of producing high-quality 1080p resolution videos in a variety of styles and can be over a minute long.
With its in-depth understanding of natural language and visual semantics, Veo models have made breakthroughs in understanding video content, rendering high-definition images, and simulating physical principles. Videos generated by Veo accurately and meticulously express the user’s creative intent.
For example, enter the text prompt:
Many spotted jellyfish pulsating under water. Their bodies are transparent and glowing in deep ocean.
(Many spotted jellyfish pulse underwater. Their transparent bodies sparkle in the deep sea.)
Another example is to generate a video of a person and enter prompt:
A lone cowboy rides his horse across an open plain at beautiful sunset, soft light, warm colors.
(Under a beautiful sunset, soft light, and warm colors, a lone cowboy rides his horse across the open plains.)
For a close-up video of a person, enter prompt:
A woman sitting alone in a dimly lit cafe, a half-finished novel open in front of her. Film noir aesthetic, mysterious atmosphere. Black and white.
(A woman sits alone in a dimly lit cafe, with an unfinished novel spread out in front of her. Film noir is beautiful and mysterious. Black and white.)
Notably, the Veo model provides an unprecedented level of creative control and understands film terms such as “time-lapse” and “aerial photography” to make the video coherent and realistic.
For example, for a movie-level aerial shot of the coastline, enter prompt:
Drone shot along the Hawaii jungle coastline, sunny day
(Drone shot along Hawaiian jungle coastline, sunny day)
Veo also supports using images and text together as prompts to generate videos. By providing reference images and text cues, Veo-generated videos follow the image style and user text descriptions.
Interestingly, the demo released by Google is an “alpaca” video generated by Veo, which is easily reminiscent of Meta’s open source series model Llama.
When it comes to long videos, Veo is capable of producing videos of 60 seconds or even longer. It can do this with a single prompt or by providing a series of prompts that together tell a story. This is very critical for the application of video generation models in film and television production.
Veo builds on Google’s visual content generation work, including Generative Query Networks (GQN), DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet, Lumiere, and more.
Starting today, Google is making Veo available as a preview in VideoFX for some creators, who can join Google’s waitlist. Google will also bring some of Veo’s features to products like YouTube Shorts.
New model of Vincentian diagram Imagen 3
In terms of text-to-image generation, Google has once again upgraded its series of models – releasing Imagen 3.
Imagen 3 has been optimized and upgraded in terms of generating details, lighting, interference, etc., and its ability to understand prompts has been significantly enhanced.
To help Imagen 3 capture details from longer prompts, such as specific camera angles or compositions, Google added richer detail to the captions of each image in the training data.
For example, add “slightly out of focus in the foreground”, “warm light”, etc. to the input prompt, and Imagen 3 can generate images as required:
In addition, Google has specifically improved the problem of “blurred text” in image generation, optimizing image rendering to make the text in the generated image clear and stylized.
To increase usability, Imagen 3 will be available in multiple editions, each optimized for different types of tasks.
Starting today, Google is offering a preview of Imagen 3 in ImageFX for some creators, and users can sign up to join the waitlist.
Sixth generation TPU chip Trillium
Generative AI is changing the way humans interact with technology while creating huge efficiency opportunities for businesses. But these advances require more compute, memory, and communications power to train and fine-tune the most powerful models.
To this end, Google has launched the sixth-generation TPU Trillium, which is the most powerful and energy-efficient TPU to date and will be officially launched at the end of 2024.
TPU Trillium is a highly customized AI-specific hardware. Many innovations announced at the Google I/O conference, including new models such as Gemini 1.5 Flash, Imagen 3 and Gemma 2, are all trained on and using TPU Provide services.
According to reports, compared with TPU v5e, Trillium TPU’s peak computing performance per chip is increased by 4.7 times, and it also doubles the high-bandwidth memory (HBM) and inter-chip interconnect (ICI) bandwidth. Additionally, Trillium features third-generation SparseCore designed to handle very large embeddings common in advanced ranking and recommendation workloads.
Google says Trillium can train a new generation of AI models faster while reducing latency and cost. Additionally, Trillium is billed as Google’s most sustainable TPU to date, with over 67% improvement in energy efficiency compared to its predecessor.
Trillium can scale to up to 256 TPUs (Tensor Processing Units) in a single high-bandwidth, low-latency computing cluster (pod). In addition to this cluster-level scalability, Trillium TPU can be expanded to hundreds of clusters and connect thousands of chips through multislice technology and intelligent processing units (Titanium Intelligence Processing Units, IPUs). Forming a supercomputer interconnected by a multi-petabit-per-second data center network.
Google launched its first TPU v1 as early as 2013, followed by cloud TPUs in 2017. These TPUs have been powering various services such as real-time voice search, photo object recognition, language translation, and even self-driving car companies. Products like Nuro provide technology power.
Trillium is also part of Google’s AI Hypercomputer, a groundbreaking supercomputing architecture designed to handle cutting-edge AI workloads. Google is working with Hugging Face to optimize hardware for open source model training and serving.
The above are all the highlights of today’s Google I/O conference. It can be seen that Google is fully competing with OpenAI in terms of large model technology and products. Through the releases of OpenAI and Google in the past two days, we can also find that the large model competition has entered a new stage: multi-modal and more natural interactive experience has become the product of large model technology and accepted by more people. The essential.
We look forward to 2024, where large model technology and product innovation will bring us more surprises.
Reference content:
https://blog.google/inside-google/message-ceo/google-io-2024-keynote-sundar-pichai/#creating-the-future
https://blog.google/technology/ai/google-gemini-update-flash-ai-assistant-io-2024