Big Questions About AI Agents and the Future of Software Engineering

As AI collapses the distance between idea and working software we have to confront the realities of developing software from first principles.

Big Questions About AI Agents and the Future of Software Engineering

As I adopt AI more into my own work, and as I try to help everyone around me leverage it more effectively, I find myself wrestling with questions that don't have easy answers. These aren't predictions. They're not pronouncements. They're the things keeping me up at night as I try to make sense of a fundamental shift in how software gets built.


What Does "Competence" Mean When Software Is One Prompt Away?

The debate about whether developers will be replaced misses the point entirely. Software isn't shrinking. It's exploding. Every workflow, every decision, every customer touchpoint is becoming more software-dependent.

The main difference today is that the distance between a good idea and "working software" has collapsed to almost nothing.

That collapsed distance doesn't account for what it means to deploy and operate software over long periods of time. It doesn't account for edge cases that only emerge under real load. Or data migrations. Or the integration you forgot about. Or the compliance requirement you didn't know existed.

And it means you can build something, roll it out, ship it to production so fast that you only realize you've made a massive mistake after it's already affecting customers. The feedback loop that used to slow you down also gave you time to think. Now that loop is gone.

To make matters worse, AI is inconsistent in ways that are hard to predict. Sometimes it considers security implications, handles edge cases, optimizes for performance. You start to trust it. Then it forgets. It generates code with an obvious vulnerability. It ignores a constraint it respected in the last session.

The unpredictability is the problem. You can't rely on it being thorough. You also can't assume it will always miss things. It keeps you off balance.

I keep coming back to this framing: we are now one "good" question away from a breakthrough and one "bad" question away from disaster. When you can generate a thousand lines of code in minutes, the quality of your intent matters more than ever.

So what does "highly competent" look like in this world?

Implementation skill still matters, but for different reasons. You need deep technical knowledge not to write the code yourself, but to recognize when the AI got it wrong. Without that foundation, you'll accept what the AI hands you because it compiles. Because it looks reasonable. Because you don't know enough to see the flaw. The false assumption that AI did the right thing is one of the most dangerous failure modes I'm seeing. The only defense is genuine expertise. Maybe this is what AGI or ASI is really meant to solve but depending on who you ask we are not there yet.

But implementation skill alone isn't enough anymore either. You need system-level thinking. How pieces fit together. What the second-order effects are. What breaks when load increases. You need taste and judgment. Knowing what to build, not just how. You need skepticism. The instinct to ask "does this actually solve the problem?" when the AI hands you something that it says is "great". And finally you need domain expertise. AI can write code, but it doesn't understand your business, your customers, or your regulatory environment.

The irony isn't lost on me: we need more human competence, not less. The people who thrive will be those who can direct AI toward the right problems and catch it when it confidently and swiftly builds the wrong thing.


Speed Creates New Failure Modes

"It works" is now table stakes. AI can produce functional code quickly. Once oyu have functional code you then get to the hard parts. Is it secure? Does it introduce vulnerabilities we won't discover until production? Is it maintainable? Will anyone understand why it was built this way in six months? Does it compose cleanly with your existing systems? Will it survive real traffic, real data volumes, real edge cases?

The seductive promise of AI-assisted development is velocity. And we are faster. But speed without direction is just arriving at the wrong destination sooner.

There's a tempting mental model emerging: if AI can regenerate code so quickly, why maintain anything? Just rewrite it when requirements change! For some software, this might actually work. Internal tools. Prototypes. Early-stage products where you can afford downtime. I also am a big believer that most software the world relies on today SHOULD be re-written with AI assistance. We could make a huge dent in transitioning legacy C code to memory safe alternatives for examples.

But for systems that run 24/7? The realities are just as difficult as before. Our systems have state, customer data, integrations, regulatory requirements. You can't "burn down" a system that's processing transactions right now. You can't casually regenerate a codebase that took years to accumulate domain knowledge about edge cases and failure modes. The speed of code generation doesn't change the complexity of live migrations, backward compatibility, or the fact that real users depend on the system while you're changing it.

AI makes you feel like everything is disposable. But your obligations to customers, partners, and regulators don't care about that feeling.

Closing the Loop

There's a discipline emerging that separates effective AI-assisted development from chaotic AI-assisted development: keeping the loop closed.

When you set an AI agent on a task, it will work. It will produce output. It will keep going. But without attention, that loop drifts. The agent solves a slightly different problem than you intended. It makes an architectural choice that conflicts with something three files away. It builds something that technically meets the prompt but misses the actual goal.

The developer's job has shifted from writing code to ensuring the loop stays closed. One of the most difficult challenges going forward is reducing how often you need to check in. Constant supervision doesn't scale. If you have to watch every step, you've lost most of the leverage AI offers.

The real skill is building an environment where the agent can verify its own work and fail fast without you. Tests that exercise the intended behavior, not just the happy path. Clear constraints the agent can check against. Fast feedback that surfaces drift early, before the agent builds three more layers on a bad foundation. Success criteria that are machine-verifiable, not just "looks right to a human."

The less you invest in that environment, the more you babysit. The more robust that environment, the more autonomy you can grant and the more leverage you get.

This is a new kind of engineering work: building the scaffolding that lets AI run safely, not just directing what it builds.

The Failure Modes I'm Watching For

I keep a running list of the things that worry me most. Building the wrong thing faster than ever, with more confidence. Accumulating AI-generated code that nobody understands. Losing institutional knowledge because "the AI will figure it out." Creating brittle systems that work in demo but fail under real conditions.

The questions I don't have good answers to: How do we capture the speed benefits without the recklessness? What guardrails ensure we're building the right thing quickly, not just a thing? How do we preserve institutional knowledge when the codebase becomes more fluid?


Our Tools Are Starting to Crack

The infrastructure of modern software development was designed for human-scale throughput. CI/CD pipelines. Pull request workflows. Code review processes. Monorepos. Branch strategies. All of it assumes humans are the bottleneck. Humans type at a certain speed. Humans review at a certain speed. Humans hold a certain amount of context in their heads.

AI breaks these assumptions.

When you generate hundreds of lines of code in seconds, a workflow designed around reviewing a human's afternoon of work becomes a bottleneck. When AI touches every file in a codebase simultaneously, branching strategies designed to prevent merge conflicts feel quaint.

I'm already seeing the cracks. Code review queues backing up because reviewers can't keep pace. Testing suites that were "comprehensive" for human-paced changes now missing coverage gaps introduced at AI speed. Documentation that was "good enough" when humans wrote the code becoming useless when AI generates code with different assumptions.

We need a way to quickly figure out which of your current practices are genuinely essential, and which are just leftover habits from pre-AI constraints?

Some practices exist because they're fundamentally important. Ensuring code does what we intend. Catching security vulnerabilities before production. Maintaining shared understanding across the team.

But other practices exist because of constraints that AI has removed. Detailed line-by-line code review made sense when code was expensive to produce. Extensive upfront design made sense when rework was costly. Careful manual testing of happy paths made sense when writing tests was slow.

I don't have answers yet, but I find myself wrestling with these tradeoffs constantly. Code review matters for security, correctness, and shared understanding. But is catching typos and enforcing style still worth the time? PR-based workflows provide audit trails. But are they just gating mechanisms designed for human throughput? Comprehensive test suites give us confidence. But are they compensating for feedback loops that no longer need to be slow?

The questions that keep nagging at me: Which of your processes are slowing you down without adding proportional value? What does "code review" even mean when the reviewer can't meaningfully evaluate AI-generated output? Are you investing in the right infrastructure for the next three years, or optimizing for the last ten?


Compliance and Trust When Nobody Reads the Code

Now lets talk about everyones favorite topics. Compliance and Regulations.

SOC 2, ISO 27001, and most compliance frameworks never explicitly require human code review. The language is deliberately ambiguous. SOC 2's CC8.1 requires that changes be "tested" and "approved" but uses passive voice throughout. NIST's Secure Software Development Framework frames human review as an organizational choice, not a mandate.

So the policies themselves? Technically, they might allow AI-assisted or even AI-only review.

But here's where it breaks down: auditors are human, and their job is to interpret ambiguous language. That interpretation is shaped by how things have always worked. The engagement manager's training. The sampling methodology. The interview questions. All of it was built on the assumption that a human developer looked at the code, understood it, and consciously approved it. That's what "review" has always meant in practice, even if the policy never said so.

When an auditor pulls a sample of 25 merge requests and sees "Approved by: claude-sonnet-4" in the log, they're going to flag it. They'll escalate. And even if someone agrees it might satisfy the control, they'll probably write it up as an observation because nobody wants to be the first to bless it.

This is why the current situation is broken and most people don't even realize it.

We have ambiguous policy language that technically might permit AI review. We have auditors whose framework assumes human review. We have organizations checking boxes that appear to satisfy controls. And we have nobody reconciling these two realities.

Let's be honest: that premise—that "eyes on every change" mitigates risk—was already fiction before AI. Developers have always rubber-stamped PRs. Reviewers have always skimmed code they didn't fully understand. The control was never as strong as the compliance narrative suggested. AI hasn't created this problem. It's just expanded it dramatically and made it impossible to ignore.

When AI generates thousands of lines of code and a human "reviews" it by skimming for obvious issues, we've maintained the form of code review without the substance. We check the box. PR approved. Two reviewers signed off. But the control isn't controlling anything. And the auditors who are supposed to catch this? Their playbooks weren't written for this world.

I keep coming back to three questions I don't have good answers to.

First: what replaces code review as the attestation of quality and intent? Automated testing catches some issues but not intent or architectural problems. AI-assisted review means trusting AI to review AI. In my experience, this actually works surprisingly well, especially when you use different models from different companies to review the same code. The models catch different things. They have different blind spots. Using Claude to review code written with GPT, or vice versa, surfaces issues that either model alone would miss. It's not a complete solution, but it's more effective than I expected. The question is whether auditors will accept it. Outcome-based metrics like incident rates tell you something, but they're lagging indicators.

Second: how do we explain to an auditor that the code works but we didn't read it? Auditors ask for evidence of code review. We show timestamps, approvals, PR comments. The reality is that nobody meaningfully evaluated 80% of those lines. This feels like a compliance gap waiting to become a finding. Right now, nobody has a good script for that conversation.

Third: what's the new trust contract between businesses? When you buy our software or integrate with our platform, you're trusting our process. Our process now includes AI-generated code that humans don't fully understand. How do we represent that honestly? What do customers have a right to know?


Where This Leaves Me

I don't have clean answers to these questions. Nobody does. We're in the middle of a transition moving faster than our ability to develop consensus frameworks.

What I do know: this is happening whether we're ready or not. Our competitors are adopting AI-assisted development. Choosing not to participate is choosing to fall behind.

Human judgment matters more, not less. The leverage has shifted from "can you implement this?" to "do you know what to implement and can you tell when it's wrong?"

Our current processes need examination. Some will remain essential. Others are just habits. We need to figure out which is which before they become bottlenecks.

And the compliance question is urgent. We're in a regulatory gray area that won't stay gray forever. Being thoughtful now is better than being reactive later.

I'm genuinely excited to see where this leads, even when it's uncomfortable. For now, I'm focused on navigating this transition thoughtfully. One question at a time.

I'd love to hear how others are thinking about these challenges and what unique ideas you are using to solve them.


Inspiration and Further Reading

This piece was shaped by a lot of reading and thinking. Here are some of the articles that influenced my perspective:

Subscribe to zsiegel.com

Don’t miss out on the latest articles. Sign up now to get exclusive members only content and early access to articles.
[email protected]
Subscribe