https://www.linkedin.com/posts/shuvendu-lahiri-9a35151_intent-formalization-a-grand-challenge-for-share-7435485535663562753-UZYs?utm_source=share&utm_medium=member_android&rcm=ACoAACSY17kB8nKeUAs4qK-Xpt5-VQ8o_lo8Y4I

risemsr.github.io Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

This is a timely piece by [Shuvendu Lahiri](<https://www.linkedin.com/in/shuvendu-lahiri-9a35151?trk=public_post-text>) from Microsoft Research as many of us lead our organizations through the shift toward agentic coding. If we think of software development as a pipeline with bottlenecks, historically the biggest constraint has been writing code. That’s where the time, cost, and effort concentrated. Generative AI is rapidly changing that. But there’s an emerging challenge: If we continue to manually review all generated code, we’ve simply moved the bottleneck from writing code to reviewing code. That doesn’t increase the overall velocity of software creation to the extent we want. Engineering teams are already exploring what it will take to build confidence in AI-generated code at scale. Today’s conversations include safeguards such as: ✅ Static analyzers ✅ Fuzz testing ✅ Code coverage tooling ✅ Behavior-injection / chaos test cases ✅ Barrages of LLM-generated unit and functional tests ✅ Independent model review (separate from the model that generated the code) These are powerful tools — but they are still adaptations of our current paradigm. What’s especially intriguing are the emerging scientific directions, like Intent Formalization, discussed in Shuvendu’s paper. If successful, approaches like this could fundamentally reshape how we validate correctness in an AI-native software development process. Curious how others are thinking about this: ➡️ Where do you see the next bottleneck in AI-assisted software development?

𝐀𝐈 𝐜𝐚𝐧 𝐧𝐨𝐰 𝐰𝐫𝐢𝐭𝐞 𝐜𝐨𝐝𝐞 — 𝐛𝐮𝐭 𝐰𝐡𝐨 𝐜𝐡𝐞𝐜𝐤𝐬 𝐭𝐡𝐚𝐭 𝐢𝐭 𝐝𝐨𝐞𝐬 𝐰𝐡𝐚𝐭 𝐲𝐨𝐮 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐦𝐞𝐚𝐧𝐭? **Intent formalization** —turning informal human intent into precise, checkable *specifications* (starting with tests) that reflect what the user truly wants—may hold the key. But the central challenge remains: how do we validate these specifications to ensure they faithfully capture intent? This question has stayed just as relevant as when we first began exploring it at the advent of code‑generation models. A blog outlining the problem, the spectrum of specifications, recent advances, and the outstanding challenges. Draws on contributions from several colleagues in the RiSE group at MSR. 🔗 [<https://lnkd.in/gB8pJiU7>](<https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Flnkd%2Ein%2FgB8pJiU7&urlhash=hxUu&trk=public_post_reshare-text>) Article: [<https://lnkd.in/gFZCcYU3>](<https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Flnkd%2Ein%2FgFZCcYU3&urlhash=_N9c&trk=public_post_reshare-text>)
1/ The research organization METR has analyzed the widely used coding benchmark SWE-bench Verified and concluded that it significantly overstates how well AI agents actually perform in real-world software development. 2/ Four experienced open-source developers reviewed 296 AI-generated code contributions and found that roughly half of the solutions that passed automated tests would still be rejected from actual software projects. 3/ Many of these rejections stem not from stylistic issues but from fundamental functional errors. The AI agents fail to fix the underlying problem, even when they manage to pass the automated test suite. More: [<https://lnkd.in/dEfUFQ_z>](<https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Flnkd%2Ein%2FdEfUFQ_z&urlhash=uxQh&trk=public_post-text>)

     [<https://the-decoder.com>     Half of AI-written code that passes industry test would get rejected by real developers, new study finds](<https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fthe-decoder%2Ecom%2Fhalf-of-ai-written-code-that-passes-industry-test-would-get-rejected-by-real-developers-new-study-finds%2F&urlhash=T5ke&trk=public_post_feed-article-content>)

[Half of AI-written code that passes industry test would get rejected by real developers, new study finds](<https://media.licdn.com/dms/image/sync/v2/D4D27AQH2YMPF4JBKOw/articleshare-shrink_800/B4DZzeBfNcGYAI-/0/1773251459250?e=2147483647&v=beta&t=qrS9AnAO7gtivG4pn-nSXIpei9Grlsher8EI7iYRvSk>)
Engineers used to write code. Now we mostly read it. A founder went to his first coding interview in years. The problem? An agents algorithm, something he works with every day. But when it came time to type the solution… He blanked. Forgot basic JavaScript syntax. Paused on simple operations. Panicked on recursion. Not because he didn’t understand the problem. Because he hadn’t been typing code for years. He’d been designing systems. Reviewing code. Working with AI. This is the subtle shift happening in engineering. We’re moving from writing every line to understanding and orchestrating systems. The knowledge is still there. But the role is changing. Less typing. More thinking. And maybe that’s not a loss. It’s just the next layer of abstraction

- 
    
    [No alternative text description for this image](<https://media.licdn.com/dms/image/v2/D4D22AQHOGRVOEy2Diw/feedshare-shrink_2048_1536/B4DZ0aa0trH4Ag-/0/1774264732804?e=2147483647&v=beta&t=t8ftYGGf6uVgWYIHrDhhAKt8w3Xn1__puor_uUnf4rU>)
    
    No alternative text description for this image
This article had me thinking: When AI can translate languages effectively, what drives language choice in software development? In my experience, most language choices are made to reduce development friction, not optimize performance. Will this change when any engineer can effectively write in any language? [<https://lnkd.in/eNxd_Wkm>](<https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Flnkd%2Ein%2FeNxd_Wkm&urlhash=MZ3Y&trk=public_post-text>)

     [simonwillison.net     Ladybird adopts Rust, with help from AI](<https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fsimonwillison%2Enet%2F2026%2FFeb%2F23%2Fladybird-adopts-rust%2F&urlhash=oC7c&trk=public_post_feed-article-content>)

[Ladybird adopts Rust, with help from AI](<https://static.licdn.com/aero-v1/sc/h/42byfw7gh464l64dlwodbbdez>)
I've been building a computer vision pipeline with Claude Code for a side project. Weeks into development, we hit a detection problem — false negatives that shouldn't have been there. Claude proposed an intersection-point approach. The math checked out. The code was clean. The logic was set out to be find where the object's path crosses a boundary line, then check if that intersection point falls inside the target zone. I stared at it for a while. Something felt off. "Wait — if the intersection point is where the path crosses the boundary, then by definition that point is ON the boundary line. Checking if it's inside the zone is meaningless. It will always be on the edge." Claude agreed immediately. The entire approach was fundamentally flawed — not a bug in the code, but a bug in the reasoning. We scrapped it and built something completely different. That new approach became the most reliable version yet. Here's the thing — I don't write Python. So you'd think I'd just trust the AI's reasoning and move on. But code fluency and logical reasoning are two different skills. Claude can generate flawless syntax in seconds. It can also build an elegant solution on top of a flawed premise without flinching. I caught the error not by reading the code, but by questioning the logic underneath it. AI pair programming isn't about letting the machine do the thinking. It's about combining two different kinds of intelligence. The AI is fast, tireless, and encyclopedic. The human is skeptical, and asks the questions that matter. The best debugging tool I had that day wasn't a breakpoint. It was intuition. [#iOSDevelopment](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fiosdevelopment&trk=public_post-text>) [#Swift](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fswift&trk=public_post-text>) [#AIPairProgramming](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Faipairprogramming&trk=public_post-text>) [#ClaudeCode](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fclaudecode&trk=public_post-text>) [#BuildInPublic](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fbuildinpublic&trk=public_post-text>)
I sent my engineers a message about AI adoption and closed it with a warning. The message began with a list of my directives for how they should and should not be using the tools. It was thorough, but it was just for them. The warning I closed with applies to pretty much everyone, though. Here it is: Do not let your skills wane Your skills are more rich than knowing Python syntax, though there's that as well. Design, architecture, orchestration, efficiency, etc. Through prompt writing, your product skills will increase. If you lean too much into the prompt work, your software engineering skills will wane over time. Don't let the rust creep in. With the speed at which code-authoring AI lets us author code, you can write the same prompt four ways and get four outputs and study and test the different architectural approaches against one another. You don't have to TYPE the code to keep your skills sharp. You just have to keep "thinking in code" if that makes sense. Me, I do not think in code much any more. So you can look to me as the warning of letting those skills fade lol.
Engineers keep asking the wrong question. “Which AI is better for coding?” That is like asking, “Is Python better than Bash?” The real question is: better for what? Some tools optimize for depth. Large codebases. Long reasoning chains. Architectural thinking. Others optimize for speed. Rapid scaffolding. Quick iterations. Exploring ideas. Neither philosophy is wrong. They reflect different design bets. One treats coding like architecture. The other treats coding like velocity. Prototype fast with one. Reason deeply with another. Ship using both. The best engineers do not pick one AI tool. They build a workflow. Which one is currently your default coding partner? Comment "PDF" and I'll send you a high-res PDF version of the cheatsheet. 👉 Follow me ([Raahul Seshadri](<https://in.linkedin.com/in/raahul-seshadri?trk=public_post-text>)) for AI insights for people who build stuff. ♻ Share/repost to help out a fellow software engineer.

- 
    
    [No alternative text description for this image](<https://media.licdn.com/dms/image/v2/D4D22AQENGyf3VH7PUQ/feedshare-shrink_2048_1536/B4DZzeKwCUIQAg-/0/1773253886754?e=2147483647&v=beta&t=U8PA11sWIMzpg8BuXHRpy_-q3VUhBCZSyCzUF3oBlus>)
    
    No alternative text description for this image
Vibe coding gets the blame for slop. The real culprit is not knowing enough to catch it. Andrej Karpathy coined the term in February 2025 for throwaway weekend projects — "forgetting the code even exists." That was never meant to describe what a senior developer does with Claude Code or Cursor on a production system. But the label stuck to everything. Now "vibe coding" covers the whole spectrum; the non-coder who generated a Lovable app with misconfigured row-level security and shipped it anyway, and the seasoned architect who scaffolded a microservice in an afternoon, reviewed every file, caught the hallucinated auth logic, and fixed the three lines that mattered. Those two people are not doing the same thing. There's an arXiv paper from late 2025 literally titled "Professional Software Developers Don't Vibe, They Control." Field observations of experienced developers found they value agents as productivity multipliers while refusing to give up design and architecture decisions — because they can tell when the output is wrong. That's accelerated engineering with a skill foundation underneath it, not vibing. CodeRabbit studied 470 open-source PRs in 2025. AI-generated code had 1.7x more major issues and 2.74x the security vulnerabilities of human-written code. The gap doesn't close because you're using Cursor. It closes when someone with real fundamentals is auditing the diff. The people shipping slop aren't vibe coding wrong. They're missing the foundation that makes any of this worth shipping. CS fundamentals, architecture instincts, language-agnostic problem-solving — those aren't less valuable in an AI-accelerated workflow. They're more valuable. The AI amplifies what you bring. Bring confusion; get chaos. Bring fundamentals; get velocity. If you can steer, it doesn't matter that you didn't build the engine. [#vibecoding](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fvibecoding&trk=public_post-text>) [#softwaredevelopment](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fsoftwaredevelopment&trk=public_post-text>) [#AI](<https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Ffeed%2Fhashtag%2Fai&trk=public_post-text>)

- 
    
    [A developer reviewing code diffs on multiple monitors in a dark workspace, with text overlay reading STEER THE SHIP.](<https://media.licdn.com/dms/image/v2/D5622AQGmetAPrHc6Kw/feedshare-shrink_800/B56Z0f7WnKIMAc-/0/1774357146296?e=2147483647&v=beta&t=fZfVDnpIvBx6BVcD1Czp-1PhPz6upG00lW7yBB1pHOw>)
    
    A developer reviewing code diffs on multiple monitors in a dark workspace, with text overlay reading STEER THE SHIP.
People ask me what keeps me up at night. My first answer: usually one of my 3 kids. My real answer: The way we assess engineering talent is not even close to caught up with where the work actually is. We are still interviewing for a world that doesn't exist anymore. Whiteboard a sorting algorithm. Solve this LeetCode problem in 45 minutes. Walk me through your system design on a blank canvas. Meanwhile, the best engineers I talk to every day are building with Cursor, Claude, Codex. They're shipping entire features by describing what they want in plain English. They're debugging by pasting error logs into a chat window and getting working fixes in seconds. The skill that matters most right now is not "can you write a binary search from memory." It's "can you see what needs to be built, tell an AI exactly what you need, and know whether what comes back is good enough to ship." That's taste. That's judgment. That's architecture thinking. None of that shows up in a 45-minute coding screen. The companies that figure out how to test for the engineer of 2026 instead of the engineer of 2018 are going to win the talent war before anyone else realizes the rules changed. We're not close. And it keeps me up at night.