LLM work isn’t just about hooking up to an API and calling it a day. You spend as much time wrangling evaluations, analyzing logs, and wrestling with “tool call” parameters as you do creating the actual app. And you quickly realize there’s a deep tension between shipping new features versus nailing down the business metrics that will justify your existence.
Below is my snapshot of Jason, Hamel, and Eugene latest discussion, LLM Office Hour of 2025 – all the highlights you need plus, I’ve pulled out a few quotes that really speak to the heart of what’s going on.
You’d think automated coding agents would be the magic bullet. Tools like Devin, Cursor, and o1 Pro promise developer superpowers. But here’s the catch:
It’s a struggle everyone in the LLM world is grappling with: where exactly does a “coding agent” make sense, and how do you reliably measure its success?
If you’ve worked on function-calling LLMs, you’ve hit the “precision vs. recall” question: does the model pick the right tool? But once that’s nailed, you’re facing bigger headaches:
Like Jason said, “Once you’re basically picking the right tool, the real trouble starts.” And that’s exactly it – good luck building robust metrics for every nuance of an API call.
Listen to Hamel talk about data, and you’ll hear the same refrain he’s been preaching for ages: people don’t actually look at their logs. They might dump them somewhere, but they don’t segment, visualize, or test small slices the way they should. Without that, you can’t fix real problems or measure progress.