• AI Valley
  • Posts
  • Google's new AI lets Robots see, speak, and act

Google's new AI lets Robots see, speak, and act

Also, Alibaba unveils emotion-reading AI model

In partnership with

Howdy again. It’s Barsee, and welcome back to AI Valley.

On this day: The initial coding of Twitter began on this day in 2006.

Today’s climb through the Valley reveals:

  • Google's new AI lets Robots see, speak, and act

  • Google’s first ever omni image AI released to the public

  • Alibaba unveils emotion-reading AI model to challenge OpenAI

  • Plus trending AI tools, posts, and resources

Let’s dive into the Valley of AI…

PEAK OF THE DAY

Google's new AI lets Robots see, speak, and act

Google DeepMind has introduced Gemini Robotics and Gemini Robotics-ER, two advanced AI models designed to make robots smarter and more capable in real-world environments.

Image Source: Google Deepmind

Here’s the breakdown:

  • Gemini Robotics is a vision-language-action (VLA) model that combines Gemini 2.0's multimodal reasoning with physical actions to help robots interpret what they see and follow verbal instructions to complete tasks.

  • The Gemini Robotics-ER variant focuses on Embodied Reasoning (ER), giving robots a humanlike ability to understand and respond to their surroundings through enhanced spatial awareness.

  • These models focus on three key areas that make robots more adaptable and effective:

    1. Generality: Gemini Robotics allows robots to adapt to new situations, objects, and instructions without specific training, letting them handle unfamiliar environments with ease.

    2. Interactivity: The models enable robots to understand natural language commands, monitor their surroundings, and adjust their actions dynamically as instructions or conditions change.

    3. Dexterity: The models excel at fine motor skills, allowing robots to perform detailed tasks like folding origami, packing bags, or handling small objects carefully.

  • Gemini Robotics significantly outperforms previous state-of-the-art VLA models across benchmarks such as instruction following (87% success rate), action generalization (52.8%), and long-horizon dexterity tasks (78.8% after fine-tuning).

  • Google DeepMind is partnering with Apptronik to develop the next generation of humanoid robots powered by Gemini. They’re also working with trusted testers, including Agile Robots, Agility Robotics, Boston Dynamics, and Enchanted Tools, to ensure its safe deployment.

Why it matters: 

These advancements bring robots closer to handling complex real-world tasks, making them more useful in industries like manufacturing, logistics, and even home assistance. By improving adaptability, communication, and fine motor skills, DeepMind is paving the way for robots that can better understand and respond to human needs.

New Sapience: The Last Step In AI

New Sapience is taking a different approach to Artificial General Intelligence (AGI), guided by Bryant Cruse, a former NASA space systems engineer.

Image Source: New Sapience

Their revolutionary Synthetic Intelligence (SI) isn’t just another AI model; it thinks, learns, reasons, builds knowledge, and adapts like a human in real-time.

Here’s the breakdown:

  • Learns through understanding, not by crunching massive datasets.

  • Distinguishes between objective facts and subjective perspectives.

  • Adapts to new concepts and comprehends natural language.

  • From healthcare to finance to education, SI has the potential to transform industries, enhancing medical diagnoses, refining predictive analytics, and personalizing learning.

  • Ethics are a key focus, with transparency and accountability built into the system.

As AI continues to advance, New Sapience offers a glimpse into a future where machines don’t just analyze information, they understand it.

Thank you for supporting our sponsors!

Google’s first ever omni image AI released to the public

Google has quietly rolled out multimodal image output capabilities for its Gemini 2.0 Flash model in AI Studio, making it the first major AI lab to offer such features ahead of competitors like OpenAI and xAI.

Gemini 2.0 Flash model in AI Studio

Here’s the breakdown:

  • Gemini 2.0 Flash now lets users create images directly alongside text, opening up new possibilities for creative and visual workflows.

  • This model is the first time ever that you can do targetted edits of pictures with English.

  • A new “Output Format” setting allows users to switch between text-only responses and text + image outputs, making the model adaptable across different tasks.

  • All generated images come with SynthID watermarks to prevent misuse and ensure content authenticity.

About AI Studio:

Since its debut in December 2024, the Gemini AI model family has rapidly evolved. Gemini 2.0 Flash takes things up a notch with:

  • Improved ability to process and understand different data types like images, text, and spatial information.

  • Enhanced speed, making workflows smoother and more responsive.

  • Real-time applications like tool calling and dynamic content generation, making the model feel more proactive and adaptable.

These features make Gemini 2.0 Flash especially valuable for industries like design, marketing, and content creation, where visual storytelling plays a crucial role.

Why it matters:

By integrating these cutting-edge capabilities, Google is setting the stage for more dynamic human-AI collaboration. The ability to seamlessly blend text and images makes Gemini 2.0 Flash a game-changer, bringing us closer to AI-powered creative tools that feel intuitive and responsive to user needs.

VALLEY VIEW

Alibaba's Tongyi Lab has unveiled R1-Omni, a sophisticated AI model that can read human emotions from videos while describing people's clothing and surroundings in detail. Building on their previous HumanOmni model, R1-Omni features stronger reasoning abilities and better integration across different types of media. This places it as a direct competitor to OpenAI's GPT-4.5, which can detect emotional cues in text. The key difference is that R1-Omni is open-source and free to access on platforms like Hugging Face, unlike its proprietary rival.

Court documents have revealed Google's substantial investment in AI company Anthropic now exceeds $3 billion, with plans to add another $750 million in convertible debt this year. Despite owning more than 14% of the company, Google maintains no voting rights or board presence. This investment strategy allows Google to support promising competitors while continuing to develop its own AI technologies.

Snapchat is launching innovative video generative AI Lenses powered by its in-house technology. These new Lenses transform regular Snaps into AI-generated video animations, automatically saving them to your Memories. The initial release features three creative options: Raccoon, Fox, and Spring Flowers, with weekly additions planned. These AI Video Lenses are currently exclusive to Snapchat Platinum subscribers, available for $15.99 monthly.

Browser Use, a technology that helps AI navigate websites and handle tasks like completing forms and saving files, has seen explosive growth after being featured in Manus, the viral AI platform. Daily downloads surged from 5,000 to 28,000 in just one week. The founder boldly predicts that AI agents will soon outnumber human users online, as the AI agent market continues its rapid expansion.

TRENDING TOOLS

  1. Flora > An interactive workspace that brings together top text, image, and video models to generate, refine, and collaborate on projects.

  2. Muse > An AI tool designed specifically for fiction writing.

  3. Mistral OCR > An API to extract highly accurate, structured texts from images and multilingual documents.

  4. Dora > A no-code web design platform that enables users to create stunning 3D animated websites effortlessly.

  5. Same > Lets you clone any website with pixel perfect accuracy.

VISUAL VALLEY

Soft Touch Images

Generated by: Olga Volosin on Midjourney

Prompt: [Image Detail], --chaos 10 --ar 2:3 --sref 694725456 --profile tckem44 --sw 500 --stylize 500

THINK PIECES / BRAIN BOOST

  1. Interesting in-depth breakdown of GPT-4.5 by Y-combinator.

  2. Will Manus AI replaces your AI tech stack? (Full Demo)

  3. AI skills earn greater wage premiums than degrees, report finds.

  4. The hidden biases in your AI.

  5. What will humans be like generations from now in a world transformed by AI.

  6. I use Cursor daily - here's how I avoid the garbage parts.

VALLEY GEMS

1/ You can now clone websites with pixel-perfect accuracy.

2/ This is easily the best article I’ve read about using LLMs and prompting.

3/ Wait, is she stuck inside windows xp background :)

4/ Learning should not be boring anymore. AI tutors are already making waves in 2025.

5/ Robots are becoming chiefs now and learning new recipes from a single demonstration.

SUNSET IN THE VALLEY

Thank you for reading today’s edition. That’s all for today’s issue.

💡 Help me get better and suggest new ideas at [email protected] or @heyBarsee

👍️ New reader? Subscribe here

Thanks for being here.

REACH 100K+ READERS

Acquire new customers and drive revenue by partnering with us

Sponsor AI Valley and reach over 100,000+ entrepreneurs, founders, software engineers, investors, etc.

If you’re interested in sponsoring us, email [email protected] with the subject “AI Valley Ads”.