> LLMs are pretty good at picking up the style in your repo. So keeping it clean and organized already helps.
At least in my experience, they are good at imitating a "visually" similar style, but they'll hide a lot of coupling that is easy to miss, since they don't understand the concepts they're imitating.
They think "Clean Code" means splitting into tiny functions, rather than cohesive functions. The Uncle Bob style of "Clean Code" is horrifying
They're also very trigger-happy to add methods to interfaces (or contracts), that leak implementation detail, or for testing, which means they are testing implementation rather than behavior
Yeah. We have an AI reviewer. Just now I had a PR where I didn't normalize paths in some configuration and then compared them. I.e. let's say the configuration had
file = /foo/bar/
and then my code would do:
if file == other_file:
...
instead of:
if normalized(file) == normalized(other_file):
...
and the AI reviewer suggested a fix by removing the trailing slash instead. I.e. the fix would have "worked", but it would've been a bad fix because the configuration isn't under program's control and can't ensure paths are normalized.
I've encountered a lot of attempted "fixes" of this kind. So, in my experience, it's good to have AI look at the PR because it's just another free pair of eyes, but it's not very good at fixing the problems when it does find them. Also, it tends to miss conceptual problems, concentrating on borderline irrelevant issues (eg. an error raised by the code in the use-case beyond the scope of the program like when a CI script doesn't address the case when Git is not installed, which is of no real merit).
Basically, nothing has changed except the increase in noise. So all the suits who refuse to understand what software is have yet again decided to make things worse for professionals and for people who actually know what they're doing.
The departments / roles that LLMs most deeply need to be pointed at - business development, contracts, requirements, procurement - are the places least likely get augmented, due to how technology decisions are made structurally, socially.
I've already heard - many times - that the place that needs the LLMs isn't really inside the code. It's the requirements.
History has a ton of examples of a new technology that gets pushed, but doesn't displace the culture of the makers & shakers. Even though it is more than capable of doing so and indeed probably should.
Not sure: The LLMs seem to be okay at coding recently but still horrible at requirements and interface design. They never seem to get the perspective right.
One example I recently encountered: The LLM designed the client/consumer facing API from the perspective of how it's stored.
The result was a collection of CRUD services for tables in a SQL db rather than something that explained the intended use.
Efficiency also means to them, "less costs" and when they talk about "costs" they mean "headcount" which that is employees.
Put it together and the suits want to reduce headcount using AI.
To them, "clean code" is a scam and a waste of time that doesn't yield them quick returns, but a weak reason for software engineers to justifying their roles.
- Adhere to rules in "Code Complete" by Steve McConnell.
- Adhere to rules in "The Art of Readable Code" by Dustin Boswell & Trevor Foucher.
- Adhere to rules in "Bugs in Writing: A Guide to Debugging Your Prose" by Lyn Dupre.
- Adhere to rules in "The Elements of Style, Fourth Edition" by William Strunk Jr. & E. B. White
e.g., mentioning Elements of Style and Bugs in Writing certainly has helped our review LLM to make some great suggestions about English documentation PRs in the past.
> - Adhere to rules in "The Elements of Style, Fourth Edition" by William Strunk Jr. & E. B. White
FYI: The third edition was the last one by E. B. White. The fourth edition was revised by someone whose identity is unclear. For something so opinionated, I'd like to know whose opinions I'm reading.
Not that it really matters for your LLM prompt, but it's worth pointing out.
I'm guessing a lot of similar debates were had in the 1970s when we first started compiling C to Assembly, and I wonder if the outcome will be the same.
(BTW: I was not around then, so if I'm guessing wrong here please correct me!)
Over time compilers have gotten better and we're now at the point where we trust them enough that we don't need to review the Assembly or machine code for cleanliness, optimization, etc. And in fact we've even moved at least one abstraction layer up.
Are there mission-critical inner loops in systems these days that DO need hand-written C or Assembly? Sure. Does that matter for 99% of software projects? Negative.
I'm extrapolating that AI-generated code will follow the same path.
The high level language -> assembly seems like an apt analogy for using LLMs but I would like to argue that it is only a weak one. The reason in that, previously, both the high level language and the assembly language had well defined semantics and the transform was deterministic whereas now you are using English or other human language, with ambiguities and lacking well-defined semantics. The reason math symbolisms were invented in the first place is because human language did not have the required unambiguous precision, and if we encounter hurdles with LLMs, we may need to reinvent this once more.
This debate didn't start with C. Compilers existed pretty much since Algol, and the debate was there too.
There were also similar arguments made about certain other mechanisms that programmers expected to become obsolete over time, but which never did. For example, all the "visual programming" that later transformed into "no code" programming never really delivered on the promise. (We still have a bunch of tools and languages that carry the "visual" adjective in them, but very few people remember why that adjective was there in the first place). Eg. Visual Basic was supposed to be visual because you'd build interfaces by dragging components into the designer view of your IDE and then using graphical editor style interface to position them, label them etc. Not only MSVS came with the "designer" view, Eclipse had one too, probably more than one. Swing had its own "designer" as did a bunch of other UI frameworks.
Programmers also believed that flat files will be gone and be replaced by some sort of a structured database with an advanced query mechanism. Again, never happened, and the idea is mostly abandoned today as the storage moved on in a completely different direction.
Object databases: (not to be confused with object store like S3) the direction that seemed all but assured at the down of objects -- never really happened either. Instead we still struggle with ORM, and it's not sure that ORM will not die off eventually.
I wouldn't be so quick as to profess AI prompt libraries to replace source code. There are many problems with this idea, beside the quality of the output. The control of the AI agents and their longevity (in source code this is ensured by standards). Collaborative programming: how people w/o access to the agent are supposed to work with the prompt library instead of the source code?
I don't think we are anywhere close to the point where we can realistically expect prompt libraries to be a generally accepted replacement for the source code. In some niche cases? -- perhaps, but not more than that.
Heres the thing about clean code. Is it really good? Or is it just something that people get familiar with and actually familiarity is all that matters.
You can't really run the experiment because to do it you have to isolate a bunch of software engineers and carefully measure them as they go through parallel test careers. I mean I guess you could measure it but it's expensive and time consuming and likely to have massive experimental issues.
Although now you can sort of run the experiment with an LLM. Clean code vs unclean code. Let's redefine clean code to mean this other thing. Rerun everything from a blank state and then give it identical inputs. Evaluate on tokens used, time spent, propensity for unit tests to fail, and rework.
The history of science and technology is people coming up with simple but wrong untestable theories which topple over once someone invents a thingamajig that allows tests to be run.
I'm with you... personally, I always found Clean/Onion to be more annoying than helpful in practice... you're working on a feature or section of an application, only now you have to work across disconnected, mirrored trees of structure in order to work on a given feature.
I tend to prefer Feature-Centric Layout or even Vertical Slices, where related work is closer together based on what is being worked on as opposed to the type of work being done. I find it to be far more discoverable in practice while able to be simpler and easier to maintain over time... no need to add unnecessary complexity at all. In general, you don't need a lot of the patterns introduced by Clean or Onion structures as you aren't creating multiple, in production, implementations of interfaces and you don't need that type of inheritance for testing.
Just my own take... which of course, has been fighting upstream having done a lot of work in the .Net space.
I am in .net as well. The clean code virus runs rampant.
Swimming in DTOs and ViewModels that are exact copies of Models; services that have two methods in them: a command method and then the actual command the command method calls, when the calling class already has access to the data the command method is executing; 3 layers of generic abstractions that ultimately boil down to a 3 method class.
Debugging anything is a nightmare with all the jumps through all the different classes. Hell, just learning the code base was a nightmare.
Now I'm balls deep in a warehouse migration, which means rewriting the ETL to accommodate both systems until we flip the switch. And the people who originally wrote the ETL apparently didn't read the documentation for any of it.
I have a bit more practical approach here (write up at some point): the most important thing is to rethink how you are instructing the agents and do not only rely on your existing codebase because: 1) you may have some legacy practices, 2) it is a reflection of many hands, 3) it becomes very random based on what files the agent picks up.
Instead, you should approach it as if instructing the agent to write "perfect" code (whatever that means in the context of your patterns and practices, language, etc.).
How should exceptions be handled? How should parameters be named? How should telemetry and logging be added? How should new modules to be added? What are the exact steps?
Do not let the agent randomly pick from your existing codebase unless it is already highly consistent; tell it exactly what "perfect" looks like.
I’ll go against the prevailing wisdom and bet that clean code does not matter any more.
No more than the exact order of items being placed in main memory matters now. This used to be a pretty significant consideration in software engineering until the early 1990s. This is almost completely irrelevant when we have ‘unlimited’ memory.
Similarly generating code, refactoring, implementing large changes are easy to a point now that you can just rewrite stuff later. If you are not happy about how something is designed, a two sentence prompt fixes it in a million line codebase in thirty minutes.
You haven't worked or serviced any engineering systems, I can tell.
There are fundamental truths about complex systems that go beyond "coding". Patterns can be experienced in nature where engineering principals and "prevailing wisdom" are truer than ever.
I suggest you take some time to study systems that are powering critical infrastructure. You'll see and read about grizzled veterans that keep them alive. And how they are even more religious about clean engineering principals and how "prevailing wisdom" is very much needed and will always be needed.
That said there are a lot of spaces where not following wisdom works temporarily. But at scale, it crashes and crumbles. Web-apps are a good example of this.
It is an interesting possibility that must be considered. Only time will tell. However I disagree.
I think complex systems will still turn into a big ball of mud and AI agents will get just as bogged down as humans when dealing with it. And even though re-build from scratch is cheaper than ever, it can't possibly be done cheaply while also remembering the millions+ of specific characteristics that users will have come to rely on.
Maybe if you pushed spec-driven development to the absolute extreme, but i don't think pushing it that far is easy/cheap. Just as the effort to go from 90% unit test coverage to 100% is hard and possibly not worth it, I expect a similar barrier around extreme spec-driven.
Clarification: I'm advocating clean code in the generic sense, not Uncle Bob's definition.
> Clean code tends to equal simple code, which tends to equal fast code.
Wat? Approximately every algorithm in CS101 has a clean and simple N^2 version, a long menu of complex N*log(N) versions, and an absolute zoo of special cases grafted onto one of the complex versions if you want the fastest code. This pattern generalizes out of the classroom to every corner of industry, but with less clean names+examples. The universal truth is that speed and simplicity are very quick to become opposing priorities. It happens in nanoseconds, one might say.
Cache-aware optimization in particular tends to create unholy code abominations, it's a strange example to pick for clean=simple=fast wishcasting.
I'm not sure if you are considering the patterns actually used in "Clean Code" architectures... which create a lot of, admittedly consistent, levels of interface abstractions in practice. This is not what I would consider simple/kiss or particularly easy to maintain over time and feature bloat.
I tend to prefer feature-oriented structures as an alternative, which I do find simpler and easy enough to refactor over time as complexity is required and not before.
nano seconds matter in some miniscule number of High Frequency and Algorithmic trading use cases. It does not matter in the majority of finance applications. No consumer finance use case cares about nanoseconds. The vast majority of money is moved via ACH, which clears via fixed width text files shared via SFTP, processed once a day. Nanoseconds do not matter here.
Humans are quite capable of bankrupting financial companies with coding issues. Knight Capital Group introduced a bug into their system while using high frequency trading software. 45 minutes later, they were effectively bankrupt.
The llm is forced to eat its own output. If the output is garbage, its inputs will be garbage in future passes. How code is structured makes the llm implement new features in different ways.
Why would “messy” code be garbage? Also LLMs do a great job even today at assessing what code is trying to do and/or asking you for more context. I think the article is well balanced though: it’s probably worth it for the next few months to try to help the agent out a bit with code quality and high level guidance on coding practices. But as OP says this is clearly temporary.
The definitions of what is messy or clean will change will llms…
But there will always be a spectrum of structures that are better for the llm to code with, and coding with less optimal patterns will have negative feedback effects as the loop goes on.
I agree with you but you can dedicate tokens to fixing the bad code that agents do today. I don’t disagree with anything you’re saying. I think the practical implication is instead of pain and jira we’ll just have dedicated audit and refactor token budgets.
I'm dealing with a situation right now where a critical mass of "messy" code means that nobody, human or LLM, can understand what it is trying to do or how a straightforward user-specified update should be applied to the underlying domain objects. Multiple proposed semantics have failed so far.
On the plus side.. AI is pretty good at creating (often excessive) tests around a given codebase in order to (re)implement the utility using different backends or structures. The one thing to look out for is that the agent does NOT try to change a failing test, where the test is valid, but the code isn't.
Swift is at least in the TIOBE Top 20 (#20) and Scratch is at #12 but more educational. I'd also add Kotlin and Dart as contenders which sit just outside the top 20.
Rust is still considered by many to be pretty niche... as much as I like Rust and as widely as it is starting to be used. I especially like it with agent/ai use as the agent output tends to be much higher quality than other languages I've tried with them.
I actively use AI to refactor a poorly structured two million line Java codebase. A two-sentence prompt does not work. At all.
I think the OP is right; the problem is context. If you have a nicely modularized codebase where the LLM can neatly process one module at a time, you're in good shape. But two million lines of spaghetti requires too much context. The AI companies may advertise million-token windows, but response quality drops off long before you hit the end.
You still need discipline. Personally I think the biggest gains in my company will not come from smarter AIs, but from getting the codebase modularized enough that LLMs can comfortably digest it. AI is helping in that effort but it's still mostly human driven - and not for lack of trying.
I think clean architecture matters a lot, even more so than before. I get that you can just rewrite stuff, but that comes with inherent risk, even in the age of agents.
Supporting production applications with low MTTR to me is what matters a lot. If you are relying entirely on your agent to identify and fix a production defect, I'd argue you are out at sea in a very scary place (comprehension debt and all that). It is in these cases where architecture and organization matters, so you can trace the calls and see what's broken. I get that largely the code is a black box as less and less people review the details, but you do have to review the architecture and design still, and that's not going away. To me, things like SRP, SOLID, DRY and ever-more important.
Amongst others reasons, one of the reasons for clean code is that it avoids bugs. AIs producing dirty code are producing more bugs, like humans. AIs iterating on dirty code are producing more bugs, like humans.
Im saying that in the before time, complexity emerged over time (staff changes, feature creep). AI coding (and its volume) is just speed running this issue.
So complexity is an issue? I don't get it. SFDC is an incredibly complex system that makes billions of dollars. Tell me why I would NOT want to be able to create a system like that with an automated tool?
> a two sentence prompt fixes it in a million line codebase in thirty minutes.
Could you please create a verifiable and reproducible example of this? In my experience, agents get slower the larger a repository is. Maybe I'm just very strict with my prompts, but while initial changes in a greenfield project might take 5-10 minutes for each change, unless you deeply care about the design and architecture, you'll reach 30 minute change cycles way before you reach a million lines of code.
No, it's valid still in 2026, I'm literally using most agents on a day to day basis.
As mentioned, I'm happy to be proven otherwise, otherwise please stop just trying to say "It's not like that anymore" when supposedly it's so easy to prove. PoC||GTFO, as we used to say back in the day when they people shipping software actually knew how to write software.
I am getting the impression that you'd be moving goalposts :)
I just checked out clang+llvm, 24 million lines of code, and implemented a simple C++ language extension with claude 4.6 opus high in less than five minutes. Complete with tests. Debugging and optimizations working seamlessly.
(It's a simple pattern matching implementation, if anyone's curious.)
> I am getting the impression that you'd be moving goalposts :)
Nope, the original goal post you set was "implementing large changes are easy [...] a two sentence prompt fixes it in a million line codebase in thirty minutes", I'm more than happy for you to prove just that, no moving.
> I just checked out clang+llvm, 24 million lines of code, and implemented a simple C++ language extension with claude 4.6 opus high in less than five minutes. Complete with tests. Debugging and optimizations working seamlessly.
Awesome! Please share the steps to reproduce, results and the prompt?
> I’ll go against the prevailing wisdom and bet that clean code does not matter any more. No more than the exact order of items being placed in main memory matters now.
This is a really funny comment to make when the entire Western economy is propped up by computers doing multiplication of extremely large matrices, which is probably the single most obvious CompSci 101 example of when the placement of data in memory is really, really important.
If it's easier for a human to read and grasp, it will end up using less context and be less error prone for the LLM. If the entities are better isolated, then you also save context and time when making changes since the AoE is isolated.
Clean code matters because it saves cycles and tokens.
If you're going to generate the code anyways, why not generate "pristine" code?. Why would you want the agent to generate shitty code?
I hope you understand that close to 100% of software developers employed by large companies have effectively unlimited access to the latest models and tools.
Denialism would have presumably worked if this was something not a lot of people had seen or used.
No certainty just like we don’t have certainty when we do it ourselves. What I’m saying is you can audit and use the LLM to help you audit efficiently: finding code, explaining logic, visualizing things etc. it’s just as powerful a tool for understanding as it is for generation
I've been working on a client/server game in Unity the past few years and the LLM constantly forgets to update parts of the UI when I have it make changes. The codebase isn't even particularly large, maybe around 150k LOC in total.
A single complex change (defined as 'touching many parts') can take Claude code a couple hours to do. I could probably do it in a couple hours, but I can have Claude do it (while I steer it) while I also think about other things.
My current guess is that LLMs are really good at web code because its seen a shitload of it. My experience with it in arenas where there's less open source code has been less magical.
Funny you should mention that. I just used a two sentence prompt to do something straightforward. It took careful human consideration and 3 rounds of "two sentence" prompts to arrive at the _correct_ transformation.
I think you're missing the cost of screwing up design-level decision-making. If you fundamentally need to rethink how you're doing data storage, have a production system with other dependent systems, have public-facing APIs, and so on and so forth, you are definitely not talking about "two sentence prompts". You are playing a dangerous game with risk if you are not paying some of it off, or at the very least, accounting for it as you go.
Actually we're going back to caring about the order of atoms in main memory. When your code has good cache locality and prefetching it can run 100 times faster, no joke. Arranging your program so the data stays in a good cache order is called data-driven design - not to be confused with domain-driven design.
One thought I've had a few times is "well.. this is good enough, maybe a future model will make it better." so I won't completely disagree.
But my counter argument is that the generated code can easily balloon in size and then if you ever have to manually figure out how something works, it is much harder. You'll also end up with a lot of dead or duplicated code.
I started a side project that was supposed to be 100% vibe coded (because I have a similar view as you). I'm using go and Bubble Tea for a TUI interface. I wanted mouse interaction, etc.. It turns out it defaulted to bubble tea 1.0 (instead of 2.0). The mouse clicks were all between 1 and 3 lines below where the actual buttons were. I kept telling it that the math must be wrong. And then telling it to use Bubble objects to avoid all this crazy math.
I am now hand coding the UI because the vibe coded method does not work.
I then looked at the db-agent I was designing and I explicitly told it to create SQL using the LLM, and it does. But the ACTUAL SQL that it persists to the project is a separate SQL generator that it wrote by hand. The LLM one that gets displayed on the screen looks perfect, then when it comes down to committing it to the database, it runs an alternative DDL generator with lots of hard coded CREATE TABLE syntax etc... It's actually a beautiful DDL generator, for something written in like 2015, but I ONLY wanted the LLM to do it.
I started screaming at the agent. I think when they do take over I might be high up on their hit list.
Just anecdata. I still think in a year or two, we'll be right about clean code not mattering, but 2026 might not be that year.
Our company makes extensive use of architectural linters -- Konsist for Android and Harmonize for Swift. At this point we have hundreds of architectural lint rules that the AI will violate, read, and correct. We also describe our architecture in a variety of skills files. I can't imagine relying solely on markdown files to keep consistency throughout our codebase, the AI still makes too many mistakes or shortcuts.
Do you have any recommendations of something like konsist/harmonize that work for multiple languages by any chance? I've been looking for a solution to org-wide architectural linting that can has custom rules or something so I could add rules for a Dockerfile as well as python in the same way.
Ever since AI coding became a thing, Clean Code advocates have been trying to get LLMs to conform. I was hoping this submission would declare "Success!" and show how he did it, but sadly it's devoid of anything actionable.
I'm not a fan of Clean Code[1], but the only tip I can give is: Don't instruct the LLM to write code in the form of Clean Code by Robert Martin. Itemize all the things you view as clean code, and put that in CLAUDE.md or wherever. You'll get better luck that way.
4. Iterate on a AGENTS.md (or any other similar file you can reuse) that you keep updating every time the agent does something wrong. Don't make an LLM write this file, write it with your own words. Iterate on it whenever the agent did something wrong, then retry the same prompt to verify it actually steers the agent correctly. Eventually you'll build up a relatively concise file with your personal "coding guidelines" that the agent can stick with with relative ease.
The last two weeks with Claude has been a nightmare with code quality, it outright ignores standards (in CLAUDE.md). Just yesterday I was reviewed a PR from a coworker where it undid some compliant code, and then proceeded to struggle with exactly what the standards were designed to address.
I threw in the towel last night and switched to codex, which has actually been following instructions.
in my experience, as long as you set up a decent set of agent definitions & a good skillset, and work in an already pretty clean codebase with established standards, the code quality an agent outputs is actually really good.
Couple that with a self-correcting loop (design->code->PR review->QA review in playwright MCP->back to code etc), orchestrated by a swarm coordinator agent, and the quality increases even further.
It's important to remember humans have shipped slop too, and code that isn't clean.
When the training is across code with varying styles, it is going to take effort to get this technology performing in a standardized way, especially when what's possible changes every 3 months.
> LLMs are pretty good at picking up the style in your repo. So keeping it clean and organized already helps.
At least in my experience, they are good at imitating a "visually" similar style, but they'll hide a lot of coupling that is easy to miss, since they don't understand the concepts they're imitating.
They think "Clean Code" means splitting into tiny functions, rather than cohesive functions. The Uncle Bob style of "Clean Code" is horrifying
They're also very trigger-happy to add methods to interfaces (or contracts), that leak implementation detail, or for testing, which means they are testing implementation rather than behavior
And so many helper factories in places that don't need it.
Yeah. We have an AI reviewer. Just now I had a PR where I didn't normalize paths in some configuration and then compared them. I.e. let's say the configuration had
and then my code would do: instead of: and the AI reviewer suggested a fix by removing the trailing slash instead. I.e. the fix would have "worked", but it would've been a bad fix because the configuration isn't under program's control and can't ensure paths are normalized.I've encountered a lot of attempted "fixes" of this kind. So, in my experience, it's good to have AI look at the PR because it's just another free pair of eyes, but it's not very good at fixing the problems when it does find them. Also, it tends to miss conceptual problems, concentrating on borderline irrelevant issues (eg. an error raised by the code in the use-case beyond the scope of the program like when a CI script doesn't address the case when Git is not installed, which is of no real merit).
[dead]
I generally agree - I think anyone who's seriously experimented with agentic coding has encountered both:
1 - Surprising success when an agent can build on top of established patterns & abstractions
2 - A deep hole of "make it work" when an LLM digs a whole it can't get out of, and fails to anticipate edge cases/discover hidden behavior.
The same things that make it easier for humans to contribute code make it easier for LLMs to contribute code.
Basically, nothing has changed except the increase in noise. So all the suits who refuse to understand what software is have yet again decided to make things worse for professionals and for people who actually know what they're doing.
The departments / roles that LLMs most deeply need to be pointed at - business development, contracts, requirements, procurement - are the places least likely get augmented, due to how technology decisions are made structurally, socially.
I've already heard - many times - that the place that needs the LLMs isn't really inside the code. It's the requirements.
History has a ton of examples of a new technology that gets pushed, but doesn't displace the culture of the makers & shakers. Even though it is more than capable of doing so and indeed probably should.
Not sure: The LLMs seem to be okay at coding recently but still horrible at requirements and interface design. They never seem to get the perspective right.
One example I recently encountered: The LLM designed the client/consumer facing API from the perspective of how it's stored.
The result was a collection of CRUD services for tables in a SQL db rather than something that explained the intended use.
Good for storage, bad for building on it.
Can you expand on what you think software is in this context?
Why do you think taht the suits refuse to understand what it is?
To the "suits" AI means "efficiency".
Efficiency also means to them, "less costs" and when they talk about "costs" they mean "headcount" which that is employees.
Put it together and the suits want to reduce headcount using AI.
To them, "clean code" is a scam and a waste of time that doesn't yield them quick returns, but a weak reason for software engineers to justifying their roles.
our CLAUDE.md / AGENTS.md specifically calls out good engineering books, which I think does help:
From https://github.com/feldera/feldera/blob/main/CLAUDE.md:
- Adhere to rules in "Code Complete" by Steve McConnell.
- Adhere to rules in "The Art of Readable Code" by Dustin Boswell & Trevor Foucher.
- Adhere to rules in "Bugs in Writing: A Guide to Debugging Your Prose" by Lyn Dupre.
- Adhere to rules in "The Elements of Style, Fourth Edition" by William Strunk Jr. & E. B. White
e.g., mentioning Elements of Style and Bugs in Writing certainly has helped our review LLM to make some great suggestions about English documentation PRs in the past.
> - Adhere to rules in "The Elements of Style, Fourth Edition" by William Strunk Jr. & E. B. White
FYI: The third edition was the last one by E. B. White. The fourth edition was revised by someone whose identity is unclear. For something so opinionated, I'd like to know whose opinions I'm reading.
Not that it really matters for your LLM prompt, but it's worth pointing out.
I'm guessing a lot of similar debates were had in the 1970s when we first started compiling C to Assembly, and I wonder if the outcome will be the same.
(BTW: I was not around then, so if I'm guessing wrong here please correct me!)
Over time compilers have gotten better and we're now at the point where we trust them enough that we don't need to review the Assembly or machine code for cleanliness, optimization, etc. And in fact we've even moved at least one abstraction layer up.
Are there mission-critical inner loops in systems these days that DO need hand-written C or Assembly? Sure. Does that matter for 99% of software projects? Negative.
I'm extrapolating that AI-generated code will follow the same path.
The high level language -> assembly seems like an apt analogy for using LLMs but I would like to argue that it is only a weak one. The reason in that, previously, both the high level language and the assembly language had well defined semantics and the transform was deterministic whereas now you are using English or other human language, with ambiguities and lacking well-defined semantics. The reason math symbolisms were invented in the first place is because human language did not have the required unambiguous precision, and if we encounter hurdles with LLMs, we may need to reinvent this once more.
This debate didn't start with C. Compilers existed pretty much since Algol, and the debate was there too.
There were also similar arguments made about certain other mechanisms that programmers expected to become obsolete over time, but which never did. For example, all the "visual programming" that later transformed into "no code" programming never really delivered on the promise. (We still have a bunch of tools and languages that carry the "visual" adjective in them, but very few people remember why that adjective was there in the first place). Eg. Visual Basic was supposed to be visual because you'd build interfaces by dragging components into the designer view of your IDE and then using graphical editor style interface to position them, label them etc. Not only MSVS came with the "designer" view, Eclipse had one too, probably more than one. Swing had its own "designer" as did a bunch of other UI frameworks.
Programmers also believed that flat files will be gone and be replaced by some sort of a structured database with an advanced query mechanism. Again, never happened, and the idea is mostly abandoned today as the storage moved on in a completely different direction.
Object databases: (not to be confused with object store like S3) the direction that seemed all but assured at the down of objects -- never really happened either. Instead we still struggle with ORM, and it's not sure that ORM will not die off eventually.
I wouldn't be so quick as to profess AI prompt libraries to replace source code. There are many problems with this idea, beside the quality of the output. The control of the AI agents and their longevity (in source code this is ensured by standards). Collaborative programming: how people w/o access to the agent are supposed to work with the prompt library instead of the source code?
I don't think we are anywhere close to the point where we can realistically expect prompt libraries to be a generally accepted replacement for the source code. In some niche cases? -- perhaps, but not more than that.
Heres the thing about clean code. Is it really good? Or is it just something that people get familiar with and actually familiarity is all that matters.
You can't really run the experiment because to do it you have to isolate a bunch of software engineers and carefully measure them as they go through parallel test careers. I mean I guess you could measure it but it's expensive and time consuming and likely to have massive experimental issues.
Although now you can sort of run the experiment with an LLM. Clean code vs unclean code. Let's redefine clean code to mean this other thing. Rerun everything from a blank state and then give it identical inputs. Evaluate on tokens used, time spent, propensity for unit tests to fail, and rework.
The history of science and technology is people coming up with simple but wrong untestable theories which topple over once someone invents a thingamajig that allows tests to be run.
No, it's not really good.
It's a pain in the ass to work in, and it produces slow code.
https://www.computerenhance.com/p/clean-code-horrible-perfor...
I'm with you... personally, I always found Clean/Onion to be more annoying than helpful in practice... you're working on a feature or section of an application, only now you have to work across disconnected, mirrored trees of structure in order to work on a given feature.
I tend to prefer Feature-Centric Layout or even Vertical Slices, where related work is closer together based on what is being worked on as opposed to the type of work being done. I find it to be far more discoverable in practice while able to be simpler and easier to maintain over time... no need to add unnecessary complexity at all. In general, you don't need a lot of the patterns introduced by Clean or Onion structures as you aren't creating multiple, in production, implementations of interfaces and you don't need that type of inheritance for testing.
Just my own take... which of course, has been fighting upstream having done a lot of work in the .Net space.
Applause.
I am in .net as well. The clean code virus runs rampant.
Swimming in DTOs and ViewModels that are exact copies of Models; services that have two methods in them: a command method and then the actual command the command method calls, when the calling class already has access to the data the command method is executing; 3 layers of generic abstractions that ultimately boil down to a 3 method class.
Debugging anything is a nightmare with all the jumps through all the different classes. Hell, just learning the code base was a nightmare.
Now I'm balls deep in a warehouse migration, which means rewriting the ETL to accommodate both systems until we flip the switch. And the people who originally wrote the ETL apparently didn't read the documentation for any of it.
I have a bit more practical approach here (write up at some point): the most important thing is to rethink how you are instructing the agents and do not only rely on your existing codebase because: 1) you may have some legacy practices, 2) it is a reflection of many hands, 3) it becomes very random based on what files the agent picks up.
Instead, you should approach it as if instructing the agent to write "perfect" code (whatever that means in the context of your patterns and practices, language, etc.).
How should exceptions be handled? How should parameters be named? How should telemetry and logging be added? How should new modules to be added? What are the exact steps?
Do not let the agent randomly pick from your existing codebase unless it is already highly consistent; tell it exactly what "perfect" looks like.
I’ll go against the prevailing wisdom and bet that clean code does not matter any more.
No more than the exact order of items being placed in main memory matters now. This used to be a pretty significant consideration in software engineering until the early 1990s. This is almost completely irrelevant when we have ‘unlimited’ memory.
Similarly generating code, refactoring, implementing large changes are easy to a point now that you can just rewrite stuff later. If you are not happy about how something is designed, a two sentence prompt fixes it in a million line codebase in thirty minutes.
You haven't worked or serviced any engineering systems, I can tell.
There are fundamental truths about complex systems that go beyond "coding". Patterns can be experienced in nature where engineering principals and "prevailing wisdom" are truer than ever.
I suggest you take some time to study systems that are powering critical infrastructure. You'll see and read about grizzled veterans that keep them alive. And how they are even more religious about clean engineering principals and how "prevailing wisdom" is very much needed and will always be needed.
That said there are a lot of spaces where not following wisdom works temporarily. But at scale, it crashes and crumbles. Web-apps are a good example of this.
I'd argue clean engineering principles (gang of 4 patterns) and clean code are not the same, and are most definitely not mutually exclusive.
> You haven't worked or serviced any engineering systems, I can tell.
I have worked on compilers and databases the entire world runs on, the code quality (even before AI) is absolutely garbage.
Real systems built by hundreds of engineers over twenty years do not have clean code.
It is an interesting possibility that must be considered. Only time will tell. However I disagree.
I think complex systems will still turn into a big ball of mud and AI agents will get just as bogged down as humans when dealing with it. And even though re-build from scratch is cheaper than ever, it can't possibly be done cheaply while also remembering the millions+ of specific characteristics that users will have come to rely on.
Maybe if you pushed spec-driven development to the absolute extreme, but i don't think pushing it that far is easy/cheap. Just as the effort to go from 90% unit test coverage to 100% is hard and possibly not worth it, I expect a similar barrier around extreme spec-driven.
Clarification: I'm advocating clean code in the generic sense, not Uncle Bob's definition.
In my experience large scale automated refactoring of code is something that works reliably for the last 3-4 months.
If you work in finance, you've probably just bankrupted your company.
Nanoseconds matter.
Clean code tends to equal simple code, which tends to equal fast code.
The order of items in memory does matter, as does cache locality. 32Kb fits in L1 cache.
If of course you're talking about web apps then that's just always been the Wild West.
> Clean code tends to equal simple code, which tends to equal fast code.
Wat? Approximately every algorithm in CS101 has a clean and simple N^2 version, a long menu of complex N*log(N) versions, and an absolute zoo of special cases grafted onto one of the complex versions if you want the fastest code. This pattern generalizes out of the classroom to every corner of industry, but with less clean names+examples. The universal truth is that speed and simplicity are very quick to become opposing priorities. It happens in nanoseconds, one might say.
Cache-aware optimization in particular tends to create unholy code abominations, it's a strange example to pick for clean=simple=fast wishcasting.
I'm not sure if you are considering the patterns actually used in "Clean Code" architectures... which create a lot of, admittedly consistent, levels of interface abstractions in practice. This is not what I would consider simple/kiss or particularly easy to maintain over time and feature bloat.
I tend to prefer feature-oriented structures as an alternative, which I do find simpler and easy enough to refactor over time as complexity is required and not before.
nano seconds matter in some miniscule number of High Frequency and Algorithmic trading use cases. It does not matter in the majority of finance applications. No consumer finance use case cares about nanoseconds. The vast majority of money is moved via ACH, which clears via fixed width text files shared via SFTP, processed once a day. Nanoseconds do not matter here.
Clean code does not equal fast code. All those abstractions produce slower code.
https://www.computerenhance.com/p/clean-code-horrible-perfor...
Humans are quite capable of bankrupting financial companies with coding issues. Knight Capital Group introduced a bug into their system while using high frequency trading software. 45 minutes later, they were effectively bankrupt.
Garbage in, garbage out.
The llm is forced to eat its own output. If the output is garbage, its inputs will be garbage in future passes. How code is structured makes the llm implement new features in different ways.
Why would “messy” code be garbage? Also LLMs do a great job even today at assessing what code is trying to do and/or asking you for more context. I think the article is well balanced though: it’s probably worth it for the next few months to try to help the agent out a bit with code quality and high level guidance on coding practices. But as OP says this is clearly temporary.
The definitions of what is messy or clean will change will llms…
But there will always be a spectrum of structures that are better for the llm to code with, and coding with less optimal patterns will have negative feedback effects as the loop goes on.
I agree with you but you can dedicate tokens to fixing the bad code that agents do today. I don’t disagree with anything you’re saying. I think the practical implication is instead of pain and jira we’ll just have dedicated audit and refactor token budgets.
I'm dealing with a situation right now where a critical mass of "messy" code means that nobody, human or LLM, can understand what it is trying to do or how a straightforward user-specified update should be applied to the underlying domain objects. Multiple proposed semantics have failed so far.
On the plus side.. AI is pretty good at creating (often excessive) tests around a given codebase in order to (re)implement the utility using different backends or structures. The one thing to look out for is that the agent does NOT try to change a failing test, where the test is valid, but the code isn't.
In the past ~15 years, there are only two new languages that went from "shiny new niche toy" to "mainstream" status. Rust and Go[0].
This fact alone insinuates that the idea of having unlimited memory or unlimited CPU clocks is just wrong.
[0]: And TypeScript, technically. But I'd consider TypeScript a fork of JavaScript rather than a new language.
Swift is at least in the TIOBE Top 20 (#20) and Scratch is at #12 but more educational. I'd also add Kotlin and Dart as contenders which sit just outside the top 20.
Rust is still considered by many to be pretty niche... as much as I like Rust and as widely as it is starting to be used. I especially like it with agent/ai use as the agent output tends to be much higher quality than other languages I've tried with them.
Zig. So make that 3.
It is also used in Ghostty, Bun which is the JS runtime that powers OpenCode, and Claude Code.
I actively use AI to refactor a poorly structured two million line Java codebase. A two-sentence prompt does not work. At all.
I think the OP is right; the problem is context. If you have a nicely modularized codebase where the LLM can neatly process one module at a time, you're in good shape. But two million lines of spaghetti requires too much context. The AI companies may advertise million-token windows, but response quality drops off long before you hit the end.
You still need discipline. Personally I think the biggest gains in my company will not come from smarter AIs, but from getting the codebase modularized enough that LLMs can comfortably digest it. AI is helping in that effort but it's still mostly human driven - and not for lack of trying.
Have you tried this in the last few months with an expensive model? (Claude 4.6 opus high, for example)
You might be pleasantly surprised if you haven’t yet.
I'm using Opus 4.6, "Effort level: auto (currently high)". Used it a fair bit this week. Results are still pretty mediocre.
It's useful, but not "give it a two sentence prompt" useful.
Are you using the planning mode?
That's the way to get it to plan an exact set of actions, given a two sentence prompt.
No, but I will definitely try it. Thanks.
I think clean architecture matters a lot, even more so than before. I get that you can just rewrite stuff, but that comes with inherent risk, even in the age of agents.
Supporting production applications with low MTTR to me is what matters a lot. If you are relying entirely on your agent to identify and fix a production defect, I'd argue you are out at sea in a very scary place (comprehension debt and all that). It is in these cases where architecture and organization matters, so you can trace the calls and see what's broken. I get that largely the code is a black box as less and less people review the details, but you do have to review the architecture and design still, and that's not going away. To me, things like SRP, SOLID, DRY and ever-more important.
Amongst others reasons, one of the reasons for clean code is that it avoids bugs. AIs producing dirty code are producing more bugs, like humans. AIs iterating on dirty code are producing more bugs, like humans.
I've seen enough dirty code (900+ tech diligences over the last 12 years) to know that many businesses are successful in spite of having bad code.
It never started that way.
Time, feature changes, bugs, emergent needs of the system all drive these sorts of changes.
No amount of "clean code" is going to eliminate these problems in the long term.
All AI is doing is speed running your code base into a legacy system (like the one you describe).
> All AI is doing is speed running your code base into a legacy system
Are you implying legacy systems stop growing because I didn't mean to imply those companies stop growing.
Not at all,
Im saying that in the before time, complexity emerged over time (staff changes, feature creep). AI coding (and its volume) is just speed running this issue.
> complexity emerged over time
So complexity is an issue? I don't get it. SFDC is an incredibly complex system that makes billions of dollars. Tell me why I would NOT want to be able to create a system like that with an automated tool?
> a two sentence prompt fixes it in a million line codebase in thirty minutes.
Could you please create a verifiable and reproducible example of this? In my experience, agents get slower the larger a repository is. Maybe I'm just very strict with my prompts, but while initial changes in a greenfield project might take 5-10 minutes for each change, unless you deeply care about the design and architecture, you'll reach 30 minute change cycles way before you reach a million lines of code.
Your observation was valid in 2025.
This is largely a solved problem now with better harnesses and 1M context windows.
No, it's valid still in 2026, I'm literally using most agents on a day to day basis.
As mentioned, I'm happy to be proven otherwise, otherwise please stop just trying to say "It's not like that anymore" when supposedly it's so easy to prove. PoC||GTFO, as we used to say back in the day when they people shipping software actually knew how to write software.
> I'm happy to be proven otherwise
I am getting the impression that you'd be moving goalposts :)
I just checked out clang+llvm, 24 million lines of code, and implemented a simple C++ language extension with claude 4.6 opus high in less than five minutes. Complete with tests. Debugging and optimizations working seamlessly.
(It's a simple pattern matching implementation, if anyone's curious.)
> I am getting the impression that you'd be moving goalposts :)
Nope, the original goal post you set was "implementing large changes are easy [...] a two sentence prompt fixes it in a million line codebase in thirty minutes", I'm more than happy for you to prove just that, no moving.
> I just checked out clang+llvm, 24 million lines of code, and implemented a simple C++ language extension with claude 4.6 opus high in less than five minutes. Complete with tests. Debugging and optimizations working seamlessly.
Awesome! Please share the steps to reproduce, results and the prompt?
> I’ll go against the prevailing wisdom and bet that clean code does not matter any more. No more than the exact order of items being placed in main memory matters now.
This is a really funny comment to make when the entire Western economy is propped up by computers doing multiplication of extremely large matrices, which is probably the single most obvious CompSci 101 example of when the placement of data in memory is really, really important.
Clean code still matters.
If it's easier for a human to read and grasp, it will end up using less context and be less error prone for the LLM. If the entities are better isolated, then you also save context and time when making changes since the AoE is isolated.
Clean code matters because it saves cycles and tokens.
If you're going to generate the code anyways, why not generate "pristine" code?. Why would you want the agent to generate shitty code?
Yes, but the problem is the advocate for it, and the text on it arrived at the correct conclusion using a bad implementation/set of standards.
c.f.,
https://github.com/johnousterhout/aposd-vs-clean-code
and instead of cleaning your code, design it:
https://www.goodreads.com/en/book/show/39996759-a-philosophy...
> If you are not happy about how something is designed, a two sentence prompt fixes it in a million line codebase in thirty minutes.
This fantasy is so far from reality with current systems and is unlikely to ever be fulfilled, even if they were a lot more capable.
I hope you understand that close to 100% of software developers employed by large companies have effectively unlimited access to the latest models and tools.
Denialism would have presumably worked if this was something not a lot of people had seen or used.
That works until you have to fix a bug
Why does having a big break this? You can have the LLM guide you through the code, write diagnostics, audit, etc.
And what's the certainty of this fix no introducing another bug?
Then you realize tests are failing but now you are not sure if they actually test against implementation or if they were genuinely good tests.
It's a slippery slope that adds more and more cruft over time and LLMs are susceptible to missing important details.
No certainty just like we don’t have certainty when we do it ourselves. What I’m saying is you can audit and use the LLM to help you audit efficiently: finding code, explaining logic, visualizing things etc. it’s just as powerful a tool for understanding as it is for generation
I've been working on a client/server game in Unity the past few years and the LLM constantly forgets to update parts of the UI when I have it make changes. The codebase isn't even particularly large, maybe around 150k LOC in total.
A single complex change (defined as 'touching many parts') can take Claude code a couple hours to do. I could probably do it in a couple hours, but I can have Claude do it (while I steer it) while I also think about other things.
My current guess is that LLMs are really good at web code because its seen a shitload of it. My experience with it in arenas where there's less open source code has been less magical.
I suspect you are not using plan mode?
Funny you should mention that. I just used a two sentence prompt to do something straightforward. It took careful human consideration and 3 rounds of "two sentence" prompts to arrive at the _correct_ transformation.
I think you're missing the cost of screwing up design-level decision-making. If you fundamentally need to rethink how you're doing data storage, have a production system with other dependent systems, have public-facing APIs, and so on and so forth, you are definitely not talking about "two sentence prompts". You are playing a dangerous game with risk if you are not paying some of it off, or at the very least, accounting for it as you go.
I don't think they are, I think they're not talking about that... "It's all about the spec."
Actually we're going back to caring about the order of atoms in main memory. When your code has good cache locality and prefetching it can run 100 times faster, no joke. Arranging your program so the data stays in a good cache order is called data-driven design - not to be confused with domain-driven design.
One thought I've had a few times is "well.. this is good enough, maybe a future model will make it better." so I won't completely disagree.
But my counter argument is that the generated code can easily balloon in size and then if you ever have to manually figure out how something works, it is much harder. You'll also end up with a lot of dead or duplicated code.
I started a side project that was supposed to be 100% vibe coded (because I have a similar view as you). I'm using go and Bubble Tea for a TUI interface. I wanted mouse interaction, etc.. It turns out it defaulted to bubble tea 1.0 (instead of 2.0). The mouse clicks were all between 1 and 3 lines below where the actual buttons were. I kept telling it that the math must be wrong. And then telling it to use Bubble objects to avoid all this crazy math.
I am now hand coding the UI because the vibe coded method does not work.
I then looked at the db-agent I was designing and I explicitly told it to create SQL using the LLM, and it does. But the ACTUAL SQL that it persists to the project is a separate SQL generator that it wrote by hand. The LLM one that gets displayed on the screen looks perfect, then when it comes down to committing it to the database, it runs an alternative DDL generator with lots of hard coded CREATE TABLE syntax etc... It's actually a beautiful DDL generator, for something written in like 2015, but I ONLY wanted the LLM to do it.
I started screaming at the agent. I think when they do take over I might be high up on their hit list.
Just anecdata. I still think in a year or two, we'll be right about clean code not mattering, but 2026 might not be that year.
Our company makes extensive use of architectural linters -- Konsist for Android and Harmonize for Swift. At this point we have hundreds of architectural lint rules that the AI will violate, read, and correct. We also describe our architecture in a variety of skills files. I can't imagine relying solely on markdown files to keep consistency throughout our codebase, the AI still makes too many mistakes or shortcuts.
Do you have any recommendations of something like konsist/harmonize that work for multiple languages by any chance? I've been looking for a solution to org-wide architectural linting that can has custom rules or something so I could add rules for a Dockerfile as well as python in the same way.
Ever since AI coding became a thing, Clean Code advocates have been trying to get LLMs to conform. I was hoping this submission would declare "Success!" and show how he did it, but sadly it's devoid of anything actionable.
I'm not a fan of Clean Code[1], but the only tip I can give is: Don't instruct the LLM to write code in the form of Clean Code by Robert Martin. Itemize all the things you view as clean code, and put that in CLAUDE.md or wherever. You'll get better luck that way.
[1] I'm also not that anti-Uncle-Bob as some are.
4. Iterate on a AGENTS.md (or any other similar file you can reuse) that you keep updating every time the agent does something wrong. Don't make an LLM write this file, write it with your own words. Iterate on it whenever the agent did something wrong, then retry the same prompt to verify it actually steers the agent correctly. Eventually you'll build up a relatively concise file with your personal "coding guidelines" that the agent can stick with with relative ease.
The last two weeks with Claude has been a nightmare with code quality, it outright ignores standards (in CLAUDE.md). Just yesterday I was reviewed a PR from a coworker where it undid some compliant code, and then proceeded to struggle with exactly what the standards were designed to address.
I threw in the towel last night and switched to codex, which has actually been following instructions.
in my experience, as long as you set up a decent set of agent definitions & a good skillset, and work in an already pretty clean codebase with established standards, the code quality an agent outputs is actually really good.
Couple that with a self-correcting loop (design->code->PR review->QA review in playwright MCP->back to code etc), orchestrated by a swarm coordinator agent, and the quality increases even further.
It's important to remember humans have shipped slop too, and code that isn't clean.
When the training is across code with varying styles, it is going to take effort to get this technology performing in a standardized way, especially when what's possible changes every 3 months.