AI Valley
Posts
OpenAI introduces operator & agents

OpenAI introduces operator & agents

PLUS: Humanity’s last exam: The one test AI couldn’t beat

Barsee
27 Jan

_{Sign up}_|_{Follow us on X}_|_Sponsor

Together with

Howdy! It’s Barsee again.

Happy Monday, AI family, and welcome back to AI Valley.

In today’s edition:

OpenAI introduces operator & agents
Humanity’s last exam: The one test AI couldn’t beat
Plus trending AI tools, posts, and resources

Ready, set, go…

OPENAI

OpenAI introduces operator & agents

(Image Source: OpenAI)

OpenAI has introduced Operator, its first AI agent designed to automate tasks by independently navigating web browsers and interacting with websites, marking a significant step toward realizing CEO Sam Altman’s vision for AI tools in 2025.

Here's what you need to know:

This model mimics human interaction with web interfaces, such as filling forms and clicking buttons, and requires user supervision for sensitive actions like banking or entering credit card information.
Operator runs on a new AI model called Computer-Using Agent (CUA), which uses GPT-4o’s vision skills with advanced reasoning to "see" websites through screenshots and interact with them by clicking, scrolling, and tapping, all without needing special integration.
OpenAI demoed the feature during a live stream, demonstrating its capabilities in performing tasks like booking reservations, ordering groceries, and buying tickets to sporting events.
It scores 58.1% on WebArena benchmark for tasks like online shopping or content management on simulated websites but performs better on real-world sites, hitting an 87% success rate on platforms like Amazon and Google Maps. However, when it comes to more complex tasks in the OSWorld benchmark, like combining PDFs from emails, its success rate drops to 38.1%.
It’s available in research preview to Pro users in the U.S. at operator.chatgpt.com ⁠.

Why it matters:

It’s like having a super cheap virtual assistant that works without any training. I’m sure we’ll see an AI agent using this kind of setup till the end of this year. It’ll combine workflows to solve big problems, and good prompt engineering will be key to making it work well.

TOGETHER WITH NATURA

Natura introduces AI People—virtual assistants with memory, personality, and initiative

(Image Source: Natura)

Most AI tools today feel robotic—resetting after every chat, forgetting context, and never truly adapting.

NatureOS changes that.

Built by Natura Umana, it introduces AI People—virtual assistants with memory, personality, and initiative. Unlike chatbots, they remember past interactions, adjust their responses to fit your preferences, and collaborate to handle complex tasks. Over time, they evolve with you, becoming more intuitive, more personal, and more useful.

And with HumanPods, you don’t need to reach for a phone or stare at a screen. These open-ear AI earbuds let you interact seamlessly—just tap, talk, and get things done. Designed for comfort, awareness, and all-day wear, they keep you connected to AI without pulling you out of the real world.

This isn’t just AI. It’s the next step in human-machine interaction.

Join the waitlist now: 👉 join-natura.com/AIV

SIDE UPDATES

🍎 Here are Apple's two main priorities for AI this year, per leaked memo

A leaked memo reveals Apple’s top AI priorities for the year: transforming Siri into "LLM Siri" by spring 2026 and enhancing its AI models. The revamped Siri is expected to debut in iOS 19.4, marking a significant leap in its capabilities. Meanwhile, Apple’s AI models, criticized for inaccuracies, are under intense scrutiny. To address this, the company has temporarily paused summaries in iOS 18.3 for certain apps, focusing on improving the underlying technology to meet user expectations.

🤖 DeepSeek gets Silicon Valley talking

DeepSeek's latest reasoning model, R1, has caught the attention of the tech world. R1 rivals OpenAI’s o1 model on key benchmarks, yet it was trained for just $5.6 million (a fraction of the hundreds of millions spent by leading U.S. firms). What’s even more impressive? This achievement comes despite U.S. sanctions limiting Chinese companies’ access to advanced chips. DeepSeek’s AI assistant has already climbed to the top of the Apple App Store’s free apps chart, showcasing its widespread appeal and cutting-edge innovation.

💡 What did DeepSeek figure out about reasoning with DeepSeek-R1?

By leveraging reinforcement learning instead of fine-tuning, DeepSeek achieved superior quality and cost efficiency. R1 excels in coding and mathematics, hinting at potential for broader applications. While it’s still unclear whether its superintelligence in these areas can translate to others, R1’s innovations are poised to reshape AI reasoning on a global scale.

📱 Hugging Face Unveils Compact AI Models for Everyday Devices

Hugging Face has launched SmolVLM-256M and SmolVLM-500M, compact visual AI models designed for devices with less than 1GB of RAM. These models excel at complex tasks across various media types, including diagram analysis and document comprehension. In benchmarks like AI2D, which tests grade-school science diagram understanding, they outperformed much larger models. This breakthrough brings high-quality AI performance to everyday devices, making advanced visual AI more accessible than ever before.

AI BENCHMARK

Humanity’s last exam: The one test AI couldn’t beat

An international research team have created a benchmark named "Humanity's Last Exam" to evaluate the limitations of large language models (LLMs), where even the most advanced AI systems currently fail 90% of the time.

Here's what you need to know:

The benchmark features 3,000 questions across 100+ specialized fields, with 42% of the questions focused on mathematics.
Nearly 1,000 experts from 500 institutions in 50 countries—including professors and PhD holders—collaborated to develop this rigorous assessment.
Beyond mathematics, the benchmark spans humanities, natural sciences, and more. To increase complexity, the questions incorporate diagrams, images, and multimedia elements, moving beyond traditional text-based challenges.

Initial results:

In early trials, top AI models like GPT-4, Claude 3.5, and DeepSeek scored below 10% on the benchmark.
A notable finding was the models' extreme overconfidence. Despite expressing high certainty in their answers, they were wrong over 80% of the time.

Why it matters:

"Humanity’s Last Exam" represents a major advancement in AI evaluation. Unlike previous benchmarks, it rigorously tests AI systems across a wide range of disciplines and formats, pushing them to their absolute limits. This provides a more thorough and nuanced understanding of their capabilities and shortcomings.

TRENDING TOOLS

BookRead > AI-powered E-Reader that makes reading effortless
Spell by Spline > A model to generate 3D worlds.
Liveblocks > Enable your users to collaborate with AI inside your product.
Telescope 2.0 > Find exactly who you’re looking for, fast.
Gemini 2.0 Flash Thinking > Enhanced reasoning model from google.

THINK PIECES / RESOURCES

Building Towards Computer Use with Anthropic: new course on DeepLearning.AI by Colt Steele, Anthropic’s Head of Curriculum.
Control your computer with your face. Google is adding a face-controlled cursor feature to Chromebooks, letting users operate their devices with head movements.
What’s next for robots.
4 Charts that show why AI progress is unlikely to slow down.
Today’s AI models have a poor grasp of world history.
Meta’s Yann LeCun predicts ‘new paradigm of AI architectures’ within 5 years and ‘decade of robotics’.

CONTENT CORNER

Salesforce CEO on unlimited AI workforce. He stated at the World Economic Forum in Davos that today's CEOs are the last generation to oversee entirely human workforces, as companies increasingly integrate artificial intelligence.

Scale AI CEO Alexandr Wang on U.S.-China AI race: We need to unleash U.S. energy to enable AI boom.

Perplexity’s founder on Deepseek: Necessity is the mother of invention.

What differentiates us from the machines? 5 perspectives from Penrose, Noble, Millar, Aaronson, and Bach.

BridgeDP Robotics, a Chinese startup specializing in bipedal control solutions, showcased their system on AGIBOT's A2 humanoid.
— The Humanoid Hub (@TheHumanoidHub)
11:30 PM • Jan 25, 2025

Dressing and dancing robots will speed adoption and accelerate widespread cultural acceptance. The humanoid robot tidal wave is coming. Units like the G1 are already shipping and only cost $16k.

THAT’S ALL FOR TODAY

That’s all for today’s issue, folks.

💡 Help me get better and suggest new ideas at [email protected] or @heyBarsee

👍️ Like what you see? Subscribe here

Thanks for being here.

HOW WAS TODAY'S NEWSLETTER

REACH 100K+ READERS

Acquire new customers and drive revenue by partnering with us

Sponsor AI Valley and reach over 100,000+ entrepreneurs, founders, software engineers, investors, etc.

If you’re interested in sponsoring us, email [email protected] with the subject “AI Valley Ads”.