THIS WEEK IN AI — Week of 16th Feb 25
In this week’s AI newsletter, we’ll explore groundbreaking advancements like Google’s AI co-scientist, designed to revolutionize research, and xAI’s Grok 3 family, pushing the boundaries of language models. We’ll also examine how Replit’s new mobile AI agent is democratizing app development and OpenAI’s SWE-Lancer benchmark is testing AI’s real-world coding prowess.
Discover trending AI tools like Google’s Career Dreamer for career path exploration and Fiverr Go for freelancer business growth. Plus, we’ll delve into key research papers, including insights on critical thinking in the age of generative AI and how LLMs learn to reason from demonstrations, to spark your own ideas and keep you informed on the cutting edge.
Latest AI development
Google’s new AI co-scientist
- Google Introduces an AI Co-Scientist: Google has developed a new AI system designed to work alongside scientists, accelerating the pace of discovery in various fields. Think of it as a super-powered research assistant.
- AI Agents Working in Parallel: The system uses multiple AI agents that specialize in different tasks, from coming up with initial ideas to rigorously testing research proposals and providing final reviews. This parallel approach allows for faster and more efficient research.
- Impressive Early Results: In initial trials at Stanford and Imperial College, the AI system has shown remarkable abilities, such as identifying potential new drug uses and predicting how genes transfer, all in a matter of days.
- Outperforming Experts: Early tests indicate that the AI achieves an accuracy rate exceeding 80% on benchmarks designed for experts. This level of performance surpasses both existing AI models and human experts, demonstrating the potential of AI to revolutionize scientific research.
- Source: https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/
xAI Launches Grok 3
- A New Family of Powerful Language Models: Elon Musk’s xAI has unveiled Grok 3, a suite of language models designed to push the boundaries of AI capabilities. This includes both standard and reasoning-enhanced versions, as well as full and reduced-size models, offering a range of options for different needs.
- Massive Compute Power Fuels Grok 3’s Performance: Grok 3 was trained using a significantly larger amount of computing power than its predecessor, Grok 2, utilizing a cluster of 200,000 Nvidia H100 GPUs. This highlights the importance of computational resources in achieving state-of-the-art AI performance.
- Reasoning Abilities and Advanced Modes: Grok 3 incorporates reasoning capabilities through a chain-of-thought approach, particularly for math and coding tasks. It also features advanced modes like “Think,” “Big Brain,” and “DeepSearch” that allow it to leverage even more processing power for in-depth analysis and web-based research.
- Outperforming Leading Models: The Grok 3 family has demonstrated superior performance compared to other leading models like Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o on various benchmarks, including math, science, and coding challenges. This showcases the advancements xAI has made in AI capabilities, especially considering the company’s relatively young age.
- Source: https://x.ai/blog/grok-3
First software creation Agent on iOS and Android
- Replit Mobile Goes Pro-Code: Replit’s mobile app has evolved from simple Python scripts to generating complete iOS and Android applications, bringing full-fledged app development to your fingertips.
- AI Agent Automates App Creation: The update is powered by “Replit Agent,” an AI system leveraging models like Claude 3.5 Sonnet to handle coding, debugging, and deployment, streamlining the entire app building process.
- Mobile-First Cloud Deployment: Replit uniquely offers mobile app development with direct cloud deployment from your phone, a feature that sets it apart from desktop-centric AI coding tools and expands accessibility.
- Democratizing App Development & Sparking Industry Evolution: This advancement makes app creation faster and more accessible, raising important questions about the future of coding and the evolving relationship between developers and AI in software development.
- Source: https://blog.replit.com/try-agent
OpenAI launches SWE-Lancer benchmark
- OpenAI Launches Real-World Coding Challenge: OpenAI introduced SWE-Lancer, a benchmark using actual freelance software jobs and a $1M prize pool to test AI coding skills in realistic scenarios.
- Beyond Simple Code Generation: SWE-Lancer evaluates AI on tasks from bug fixes to feature development sourced from Upwork, judging both coding ability and crucial technical project management decisions.
- AI “Earnings” as Performance Metric: Success is measured by how much “money” an AI model could theoretically earn by completing tasks, providing a practical and relatable metric for coding proficiency.
- Current AI Shows Promise, Hints at Job Market Shift: While top models like Claude 3.5 Sonnet solved nearly half the tasks (earning a hypothetical $400k), the benchmark highlights both AI’s rapid progress and the potential for significant changes in software development roles.
- Source: https://openai.com/index/swe-lancer/
Prepare for your next AI role
Prepare for Large Language Model & GenAI interviews by learning real interview questions from FAANG and Fortune 500 companies.
Learn all the answers in a structured framework specifically designed & tested in companies like Google, Microsoft, Nvidia, Apple etc.
LLM Interview Prep Course (LLM50 to get 50% off): https://www.masteringllm.com/course/llm-interview-questions-and-answers
Trending AI Tools
- Career Dreamer — Google’s AI experiment to discover career possibilities.
- Fiverr Go — Empowering freelancers to scale their business with AI.
- R1 1776 — DeepSeek’s R1 reasoning model post-trained by Perplexity AI to remove censorship.
Open Source AI Projects
- OmniParser V2: Turning Any LLM into a Computer Use Agent.
- SkyThought: Train your own O1 preview model within $450.
- Qwen2.5-VL: Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
AI Must Read Papers
- The Impact of Generative AI on Critical Thinking — https://www.microsoft.com/en-us/research/uploads/prod/2025/01/lee_2025_ai_critical_thinking_survey.pdf
- LLMs Can Easily Learn to Reason from Demonstrations: https://arxiv.org/pdf/2502.07374
- Qwen2.5-VL Technical Report: https://arxiv.org/pdf/2502.13923
- Cramming 1568 Tokens into a Single Vector and Back Again: https://arxiv.org/pdf/2502.13063
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/pdf/2502.14786
- Titans: Learning to Memorize at Test Time: https://arxiv.org/pdf/2501.00663
- BioEmu-1: https://www.microsoft.com/en-us/research/blog/exploring-the-structural-changes-driving-protein-function-with-bioemu-1/
Some Other Important News
- Microsoft’s Majorana 1 chip — https://news.microsoft.com/source/features/ai/microsofts-majorana-1-chip-carves-new-path-for-quantum-computing
- SigLIP 2: A better multilingual vision language encoder: https://huggingface.co/blog/siglip2
We truly value your input. Please share your thoughts in the comments to help us improve.