First Applied LLM office hour of 2025

LLM work isn’t just about hooking up to an API and calling it a day. You spend as much time wrangling evaluations, analyzing logs, and wrestling with “tool call” parameters as you do creating the actual app. And you quickly realize there’s a deep tension between shipping new features versus nailing down the business metrics that will justify your existence.

Below is my snapshot of Jason, Hamel, and Eugene latest discussion, LLM Office Hour of 2025 – all the highlights you need plus, I’ve pulled out a few quotes that really speak to the heart of what’s going on.

The Juggling Act of Code Generation

You’d think automated coding agents would be the magic bullet. Tools like Devin, Cursor, and o1 Pro promise developer superpowers. But here’s the catch:

Inconsistency: Hamel tested Devin with 20 different tasks. It nailed a few, then completely misfired on others—no clear pattern.
Workflow Preferences: Some devs want big one-shot code generation. Others, like Eugene, prefer a step-by-step approach where the model is a supportive pair programmer, not an all-or-nothing solution.
Trust Factor: People feel uneasy about letting these tools auto-commit large code changes. It’s one thing to review a hundred lines in a pull request; it’s another to watch an AI produce 700 lines out of thin air.

It’s a struggle everyone in the LLM world is grappling with: where exactly does a “coding agent” make sense, and how do you reliably measure its success?

Where Evaluation Gets Tangled

If you’ve worked on function-calling LLMs, you’ve hit the “precision vs. recall” question: does the model pick the right tool? But once that’s nailed, you’re facing bigger headaches:

Complex Arguments: APIs often have 5–6 parameters, each with optional fields, date ranges, time zones, you name it. How do you verify that the model selects every argument correctly?
Holistic vs. Piecewise: Is it enough to say “the function arguments look okay,” or do you need to see the end-to-end output that includes multiple function calls and final user-facing results?

Like Jason said, “Once you’re basically picking the right tool, the real trouble starts.” And that’s exactly it – good luck building robust metrics for every nuance of an API call.

Data Analysis: The Underrated Cornerstone

Listen to Hamel talk about data, and you’ll hear the same refrain he’s been preaching for ages: people don’t actually look at their logs. They might dump them somewhere, but they don’t segment, visualize, or test small slices the way they should. Without that, you can’t fix real problems or measure progress.

Start Small: “Look at 10 rows, or 100, not all 3 million. You’ll see your biggest issues after that first peek.”
Spreadsheets vs. Notebooks: Eugene loves pivot tables because they’re interactive. Hamel prefers Pandas in a Jupyter environment. The point is: pick something that makes data exploration intuitive.