https://www.youtube.com/watch?v=D7BzTxVVMuw
Talk Title: Why Agent Engineering
Speaker(s): Swyx / Latent.Space
Summary:
- Main Topic/Thesis: Swyx argues that 2025 is the year of AI agents, explains why agent engineering is the focus of the summit, and defines what constitutes an AI agent. He also explores the current state and future potential of agent engineering.
- Key Points:
- AI Engineering is emerging as its own discipline, distinct from both Machine Learning Engineering and Software Engineering.
- The summit is focused on agent engineering because agents are showing real-world applicability and growth potential, especially in areas like coding and support.
- An "agent" can be defined in many ways, but key characteristics include goal-orientation, tool use, control flow, long-running processes, and delegated authority.
- Agents are becoming viable now due to increased model capabilities, model diversity, decreasing cost of intelligence, and advancements in RL fine-tuning.
- The growth of ChatGPT and other AI products is directly tied to the deployment of agentic models.
- Supporting Evidence:
- References to predictions and statements from industry leaders like Satya Nadella, Greg Brockman, and Sam Altman.
- Charts showing the decreasing cost of GPT-4 level intelligence and the growth of ChatGPT users.
- Examples of successful agent use cases (deep research, coding, support) and "anti-use cases" (flight booking, Reddit astroturfing).
- Reference to Simon Willison's crowdsourced definitions of AI agents.
- OpenAI's new definition of agents.
- Conclusions/Recommendations:
- 2025 is poised to be a significant year for AI agents.
- AI engineers should focus on building agents as a core skill.
- The field of agent engineering is still evolving, and there are many opportunities for innovation.
- Key Quotes:
- "2025 is the year of agents, right? If you say it often enough, it might be true."
- "The growth of ChatGPT and the growth of any AI product is going to be very, very tied to reasoning capabilities and the amount of agents that you can ship for your users."
- "The job of AI is now evolving towards building agents in the same way that MLEs build models, software engineers build software."
Talk Title: Building and Evaluating AI Agents That Matter
Speaker(s): Sayash Kapoor / AI Snake Oil
Summary:
- Main Topic/Thesis: Sayash Kapoor discusses the challenges of building and evaluating AI agents, highlighting the gap between theoretical capabilities and real-world reliability, and advocating for a reliability-focused mindset in AI engineering.
- Key Points:
- Evaluating agents is difficult due to their open-ended nature, interaction with environments, and the need to consider both cost and accuracy.
- Static benchmarks can be misleading for agent performance because they don't account for real-world interactions and cost constraints.
- The focus should be on reliability (consistent performance) rather than just capability (potential performance).
- Verifiers (like unit tests) can be imperfect and may not guarantee reliability.
- AI engineering should be approached as a reliability engineering problem, focusing on building robust systems that work with inherently stochastic components.
- Supporting Evidence:
- Examples of agent failures in real-world applications (DoNotPay, LexisNexis, Sakana AI).
- Presentation of the CoreBench benchmark and findings on the limitations of current agents in scientific research.
- Discussion of the Holistic Agent Leaderboard (HAL) and its approach to multi-dimensional evaluation.
- Analysis of the Devon agent's real-world performance.
- Reference to the "Who Validates the Validators?" framework from Berkeley.
- Examples of false positives in coding benchmark unit tests.
- Historical example of the ENIAC computer and its reliability challenges.
- Conclusions/Recommendations:
- AI engineers need to prioritize evaluating agents rigorously, considering cost and real-world interactions.
- Over-reliance on static benchmarks should be avoided.
- A shift towards a reliability engineering mindset is crucial for building successful AI agents.
- The focus should be on designing systems that work around the stochastic nature of LLMs.
- Key Quotes:
- "Evaluating agents is genuinely a very hard problem."
- "Benchmark performance very rarely translates into the real world."
- "Closing this gap between the 90% and the 99.9% is the job of an AI engineer."
- "AI engineers need to be thinking about...fix[ing] the reliability issues that plague every single agent that uses inherently stochastic models as its basis."