Is AI a Silver Bullet?

June 24, 2024

tl;dr

Brooks' hypothesis in “The Mythical Man Month” is that the nature of software development means that no further “order of magnitude” improvements to productivity remain to be found has proven remarkably solid despite forty years of software history.

Claims made that LLMs will cause significant changes to software development can be tested against Brooks' argument.

Like previous attempts to find a silver bullet such as 4GLs, it does not seem likely that LLMs as a software authoring tool will be succeed. In particular, like 4GLs they suffer from the problem that because software spends most of its life in maintenance, the cost of change makes most improvements to the cost to author irrelevant; in fact, some rapid authoring techniques make maintenance harder and increase the lifetime cost of software.

They can and will add value as an “expert assistant” but this cannot be an order of magnitude improvement due to the fact that it cannot impact the essential complexity only the accidental complexity where no order of magnitude improvement remains to be obtained.

As such whilst AI tools may provide valuable additions to IDEs or other tools used for software development, they will not replace software developers, or the need for high-level languages.

Introduction

When it comes to LLMs as a developer tool or replacement of software developers, popular opinion seem predicts one of three main outcomes to the current boom:

AI Developers: Under this model, software developers no longer need to write code. Instead that task is now done by an AI coder. Subject matter experts interact with an AI that does all the work of writing the code.
AI Assistant: Under this model, software developers get assistance from common or repetitive tasks from an AI assistant. This increases their productivity; by how much becomes the concern. Does this reduce developer headcount?
AI Winter: Under this scenario, the LLM revolution fails to deliver the advertised gains, and may even lower productivity because in the longer run it gets in the way as companies who outsourced their code to an AI are forced to do so much remedial work it would have been less expensive to hire a software engineer write high quality code.

For anyone considering a career in software development the difference between the outcomes is stark. In one the the job of software engineer would seem to disappear for most people, automated into obsolescence like the weavers; in the second the job may change, but the number of people the industry employs declines making it particularly difficult to obtain entry-level roles; in the third wider AI research is set back by the LLM hype cycle's collapse.

The agenda here is the productivity of software development; the hope is that AI will make software development cheaper. For AI maximalists software engineers will no longer be needed; even for the minimalists less of us will be needed for a given task. The optimists spin this as more software engineers available for work; the pessimists as more engineers out of work.

Predicting a future is fraught; a lot of people don't like to make predictions but adopt a “wait and see” attitude. This is particularly true when we begin to talk about a fast moving field like AI. And it is worth pointing out that I am neither involved in frontier AI research, nor do I have any formal background in AI.

So why contribute to the debate?

The productivity of software engineers is not a new problem. For years businesses have been looking for the “silver bullet” that will deliver radical changes in software productivity. The “software crisis” was the term coined at the NATO Software Engineering Conference in 1968 to describe the difficulties in writing software: cost overruns, loss of money and fatal accidents. Whilst we have made many strides in how we do software engineering since that time, history suggests that progress is incremental and not revolutionary.

Software developers are subject to the law of supply and demand. As the Information Age has created an insatiable demand for software engineers so the cost of software engineers has risen. It the costs rise enough it becomes economically attractive to invest in increasing supply or productivity (this is a part of the software crisis).

Retraining people as software engineers (once non computer-science graduates into software engineering; more recently boot camps).
Looking for silver bullets, improvements to productivity through increased language abstraction, OOP, FP, frameworks etc.
Creating low code tools to make Subject Matter Experts (SMEs) software engineers.
Outsourcing to increase the supply via cheaper, foreign labour.

AI is just another attempt to fix the problem.

Not Talking About LLMs as a Component

This post is not about the usage of an LLM as a component in your system to solve a particular problem such as a customer chat bot; it just focuses on how LLMs are likely to impact software developer employment.

A valid criticism is that this post does not consider the possibility of LLMs solving the essential complexity of some problems and the impact of that on software developer employment.

The likelihood is that this impacts specific sectors where LLMs are a good fit for specific challenges over others. That discussion is outside the scope of this post.

What is Software Engineering?

The primary activity for software engineers is the creation of computable models. Presented with a set of functional requirements, we figure out how to turn that into something a computer can do.

This involves two steps: modeling the problem in a computable fashion and turning the outcome of that, the design, into code – instructions the computer can understand. Both of these steps are significant, though it doesn't really matter for this discussion how we define the modeling process: use cases and UML diagrams; CRC Cards; boxes and lines on a whiteboard; or a TDD test list or other code-based exploration of the model.

A key problem here is that productivity improvements typically address only the second step, turning a design into code, and ignore the first step creating the model.

Ironically, this misconception may have been exacerbated by agile software development techniques that tend to prefer exploratory modelling with code, fooling people into believing that modelling was not happening, just turning designs into code. This is not true; agile engineering practices just seek to use code for both modelling and instructions for the computer (and are themselves a reaction to CASE tools, see below).

No Silver Bullet

In a 1986 Fred Brooks wrote a paper “No Silver Bullet” that suggested that further improvements on developer productivity would be hard won. Indeed Brooks offered that there would be no order of magnitude increase within the decade. Brooks identified that this was because there were two facets to the task of software engineering and only one could benefit from productivity improvements:

Essential Complexity: the work involved in modelling a solution to a problem
Accidental Complexity: what is required to turn that model into computer code (in essence writing code)

Brooks' argument is that productivity gains can only come from reducing accidental complexity. The essential complexity of a problem is irreducible:

“I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation.”

“All of the technological attacks on the accidents of the software process are fundamentally limited by the productivity equation:

Time of task = ∑i (Frequency)i x (Time)i

If, as I believe, the conceptual components of the task are now taking most of the time, then no amount of activity on the task components that are merely the expression of the concepts can give large productivity gains.”

Brooks, Frederick P.. The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition (2nd Edition)

Brooks proposed that unless we assume that we have yet to solve 9/10 of the problems from the accidental complexity of software engineering, no innovation in software engineering can produce an order of magnitude productivity jump. For Brooks, we have already crossed the point where we have made the really big gains in productivity. Improvements may chip away at the accidental complexity but not enough complexity remains to radically shape up how we write code.

The reality is that forty years since Brooks wrote his paper most of the tools of programming have changed little since Brooks wrote his paper in 1986. There have been incremental improvements: OOP, frameworks and libraries, package managers, agile engineering practices etc. but no order of magnitude increase. This could be considered proof of Brook's hypothesis. Any hype by the LLM maximalists surely has to overcome Brooks' argument.

Generations of Languages

Most of our efforts to improve productivity have focused turning the design into code.

We first programmed computers by flipping switches; over successive generations of languages we have moved the level of abstraction up from being physical inputs into the machine to something that can be typed and understood by a human. We call each successive raising up of the level of abstraction from the machine a language generation.

The generations can be understood as follows: each higher generation could be implemented by generating code of the lower generation. This is useful as it allows the existing tools from that generation to continue to work. We can thus progress incrementally. So a 3GL compiles to a 2GL which is assembled into to a 1GL which directly instructs the machine.

More formally:

First Generation: Machine Code, binary input of codes understood by the chip, originally by operating switches on the machine:
Second Generation: Assembler, human readable macros that can be assembled into machine code:
Third Generation: High-Level, contain control structures, variables etc. Originally a compiler turned these into assembly which was then turned into machine code. A modern compiler turns statements into object code; an interpreter turns statements into byte code which it then executes. High Level Languages include C, C++, C#, Go, Java, Pascal, Rust, Zig and many other languages used today.
Fourth Generation: Declarative not Imperative statements. Naively the 4GL generates a 3GL output that is then compiled, but just as 3GLs don't strictly work this way any more, not all 4GLs generate 3GL statements but instead directly create object or byte code. 4GLs include: Visual Design Tools (Visual Basic, Access, PowerBuilder, SQLWindows); Domain Specific Languages (HTML, WPF, HCL); Data DSLs (SQL).
Fifth Generation: It's questionable if a 5GL has ever really existed, in the sense of generations; the term was co-opted for AI programming languages in the 90s, such as Prolog, but they are not really raising the abstraction to be a true 'generation'. We will ignore them for the purpose of this discussion as they have had little impact outside academia.

The Triumph of 3GLs

Today, most of us use 3GLs. We talk about compiling (or interpreting) our code, and we rarely need to deal directly with assembly or byte code. This has proved to be a productive level of abstraction. (I entered the industry at the point where inline assembler was still enough of a performance hack that you had to know a little bit of assembler even if you were using C or C++).

Alongside that, we use 4GLs in specific scenarios where they have proven useful. Some of the most common examples that spring to mind are:

SQL to work with data
HTML, CSS to work with web documents
HCL (Terraform) to declare infrastructure

In addition: many developers use visual designers to generate code for desktop and mobile UIs.

Although 4GLs have thrived in these niches, the general purpose programming language, the 3GL, has remained the workhorse of our computing abstractions. 3GLs have increasingly extended their syntax to support a mix of imperative and declarative statements (for example LINQ in C#), and generate code from these 4GL statements.

Why Did 4GLs Only Excel in Niches?

Perhaps the domination of 3GLs is surprising. If a 4GL can raise the level of abstraction it should make us more productive, why have 4GLs remained in niches? Why have we not shifted up a whole abstraction layer. There have been many historical attempts to do just that. In the mid-90s it seemed that 4GLs might dominate software development. It would not have been unusual to find most of the code in an enterprise being written in Visual Basic or PowerBuilder. What happened?

The problem is a classic trade-off: what many innovations in the space took from the accidental complexity of authoring software, they added into the accidental complexity of owning that software.

This is a problem we repeatedly forget, and relearn. Any successful software will be changed as the needs of its users change. Most of the lifetime of any successful piece of software will be in the ownership phase. During ownership our costs will be the cost of change. So to reduce overall costs, we need to reduce the cost to change our software.

“The first term of what I’ve dubbed “Constantine’s Equivalence,” then, is that the cost of software is approximately equal to the cost of changing it.”
Beck, Kent. Tidy First?

Whilst improvements in the productivity of authorship initially prove attractive, and gain traction, because the cycle has not reached the point of maintenance, the problem of ownership inevitably shows up as we have to live with the costs of what we authored.

Optimizing for authorship is usually a poor trade-off in the long-run as we spend more time on maintenance and modification that on creation. (In some scenarios software is highly-disposable, and in those contexts low authorship cost may be an advantage, but for most software if it becomes successful it will spend much long being owned than authored, so that is where the cost lies.)

The problem for many tools in the 4GL era is that whilst they made software easier to author, they made it harder to maintain.

CASE Tools

in the 1990s the large players looking at 4GLs backed Computer Aided Software Engineering (CASE) tools. CASE promised to take use visual designs tools and generate software. CASE tools include what we now call Low Code environments and Model Driven Engineering; these are just current iterations of these tools.

CASE tools included both workbenches and environments.

Workbenches

Their goal was to enable experts to design software without software engineers. Tools like DataEase, MS Access or Hypercard offered simple Line of Business application development where all that was needed was a form editor and a database.

The popularity of many of these tools led to the growth of “shadow IT” where business departments developed their own software, outside of the traditional IT department. Given the backlogs faced by IT departments, this seemed a solution by decentralizing the provision of IT systems.

The problem became that the departments simply created new software developers, albeit ones skilled with Workbenches. Any tool requires training and experience to use; it doesn't come as part of the skill set of being a subject matter expert. Workbenches shifted many subject matter experts into software developer roles.

At first, this was empowering, because decentralized IT empowered end users to solve their own problems, often avoiding long backlogs, are cost. But such departments then hit all the problems of ownership. When individuals left, departments had no backfill or found it difficult to transfer software not written with the idea that others might maintain it. In many cases they were forced to turn to IT for further maintenance.

IT had engineers skilled in 3GLs; these engineers then rewrote these 4GL workbench applications in 3GLs. Overall the cost was higher than having IT written it in the first place.

In addition, compliance and risk concerns emerged as the management of data spread throughout the organization, which required the application of software engineering discipline but was now uncontrolled.

Some organizations even banned end users from writing their own software due to the costs of this “shadow IT” hangover.

(Arguably a spreadsheet is also 4GL Workbench, although it is not usually bucketed alongside other Low Code tools.)

In addition these tools worked well for a range of built-in scenarios, but outside of those scenarios could not work at all. They could not be extended. When these changes were needed, the business had to reach out to IT. Again, to meet these new needs IT often had to rewrite the software created by these workbenches.

As a remedy, workbenches began to transform by adding programming in a high-level language, effectively becoming developer tools.

The developer versions of these workbenches included tools like Visual Foxpro, Visual Basic and PowerBuilder; these developer versions could be extended more readily by a 3GL. They were a compromise, trying to take the benefits of forms based programming in reducing the cost of writing UIs and marry it to the ability to extend with 3GLs. They were often be used to replace tools written by a “shadow IT” workbench. These were successful in the 90s but developers found that their limitations, particularly their orientation around form-based programming, were difficult to escape, particularly to capitalize on changes in technology.

Because these tools could only be used to write form-based software, they were not suitable for other applications. Their skill sets were seen as less flexible than those of true 3GLs, and thus less valuable, often resulting in developers looking down on those using these tools. In turn that made these jobs less desirable.

These issues came to a head with the rise of the web. These workbenches focused on rich-client development. With the move to the web and their implicitly distributed nature these tools were left behind. Developers working in these skill sets needed a path out and onto the web if they were to remain relevant Whilst tools tried to catch up, the pace of change of the web outpaced the re-development of these tools.

Quickly frameworks replaced workbenches. Most of the benefits to developers in using a workbench tool could be provided by a framework. For example, MS created WebForms for .NET as a way to leverage the form-based programming model and help the workbench developers transition onto the web.

The problem was that whilst workbenches offered productivity they could not adapt well to change, either to changes to how the business wanted to work, or to changes in technology.

Of course, the complexities of the form-based, stateful abstraction of Web Forms over the stateless protocol of HTTP soon began to tell in maintenance. Web Forms was rapidly eclipsed by Domain Specific Language oriented frameworks like Ruby on Rails (see DSLs below).

Nonetheless Workbenches continue to exist today for use by end users and developers. For example Microsoft's Power Platform. For the purposes of this blog it is worth noting that few developers today fear that they will lose their jobs because end users can use a Workbench like Power Platform.

Environments

UML extended simple OO diagramming notations to allow them to become a 4GL. (As an aside a lot of people feel that extending UML in this way is what killed UML, and its predecessor notations with it as a useful OO diagram notation).

In this view of software engineering analysis becomes new key role: capturing use cases, developing analysis and design models in UML and generating much of the software (or at least scaffolding much of it to fill in some blanks). The reasoning does make some sense. If code is just the accidental complexity and modelling the essential complexity, why not focus our tooling on modelling, and then just generate the code. The implications, for example no longer caring about the high-level languages you generate the code in, seem to have value.

But the CASE revolution never happened. Why?

The problem turns out to be that this creates two models: the one in the analysis tooling and the one in the code. And they can become out-of-sync. UML could not capture all of the design; a graphical notation had many ambiguities and details that needed filling in via a programmer using a 3GL. This breaks the whole edifice of the environment. As soon as the programmer began to modify the generated model, the two became out-of-alignment. Further change became complex, because the tool had to account for change made by the programmer. This required round-tripping: model –> code and code –> model. But round-tripping never worked – because UML was not complete.

Once again the same problem surface for 4GLs used in this role: the cost of software is in ownership, and thus change, and not in authorship. Because these tools made software harder to change, they were more expensive in the long run.

If we are optimizing for readability, more code faster is not better.

As an aside, two movements can partially be seen as a reaction to the failure of environments, both of which focus on code as the primary artefact:

Agile: which favors working code over modelling tools
Domain Driven Design advocates may note its admonition to use one model, expressed in code, as a reaction to environment led approaches).

These emerge precisely because it proves to be easier to model in code than in a graphical notation.

Code Generation

Software developers have used code generation as technique for many, many years. Typically the goals are both productivity, by removing boilerplate or repetitive work, and heavy-lifting for junior developers by creating a “paint-by-numbers” framework for them.

These work best when they are “run and done”, with no round-tripping issue. Even so, they tend to have risks associated, such as the investments in these tools baking in older frameworks, language versions etc because they become expensive to maintain.

Attempts to do round-tripping with code generation often fail.

Maintenance tends to provide a particular problem as a result. If you find a defect in one application's generated code and it is likely that the defect has propagated to all codebases that used that version of the generator. If you want to add new features or enhancements, then this needs to be added to every codebase, or we have to cope with legacy feature sets in specific applications. This propagation of errors and misalignment on features is an expensive burden once code generation is used at scale.

In many cases code generation is replaced by frameworks. The advantage of a framework is that you can fix the issue once, create new features and simply redeploy, thus reducing the cost of ownership (which as we have stated is the cost worth reasoning about).

Finally, code generation often suffers from the problem that it does not build understanding. Because the team did not build the code up, or use publicly available, well-documented frameworks, the onboarding costs are high.

DSL Revolution

We mentioned earlier that Ruby on Rails created the shift toward the framework as the primary way to reduce the accidental complexity of writing web sites over the Workbenches of the 90s. Part of the success of Rails was due to Ruby's syntax making it possible to use many declarative constructs within the language; Rails created a Domain Specific Language (DSL) for web programming.

In 2010 Martin Fowler wrote a book called Domain Specific Languages; the era of Domain Specific Modeling beckoned. Inspired by developments in the Ruby community and the promise of Language Workbenches, it seemed that developers would write DSLs with Subject Matter Experts for given domains using Language Workbenches. Those DSLs would be used to craft software. No more writing your eCommerce site in HTML and a 3GL, you would write it in a DSL that generated the relevant code for you.

The aspirations of this movement reached further, envisioning a small number of developers and a large number of domain experts glueing together their work: a small group of developers create re-usable modules that can be configured via DSLs and chain them together to create products, the software factory.

But the DSL revolution never happened. We did not all start working with software factories. Instead, we kept using 3GLs.

What happened?

As we will now be familiar from other failures, DSLs suffered key issues:

Problem of Modifiability: when the domain changes you first need to change the DSL before you can express the new concepts. The DSL developers have specific expertise – language parsing and building tools via a language workbench – and work often queues for those developers, slowing throughput in maintenance.
Problem of Onboarding: everybody needs to learn your DSL. That creates a barrier to hiring/onboarding as opposed to a 3GL
Problems of Feature Set: you can do 80% easily but 20% not at all. This becomes particularly acute during technology change – the 4GL tools often have no provision for that technology change, and cannot be re-tooled as quickly as new libraries can extend a 3GL.
Problem of Tooling: Without tooling that allows debugging, tracing or inspection of DSL statements it becomes hard to maintain. DSLs tools often give way to 3GLs because of this.

Ultimately DSLs work best where the domain is stable and we build a tool chain to support them: UIs (HTML and CSS), data access (SQL) etc.

But even in areas like Infrastructure As Code (IaC), hard to debug, feature set etc makes it arguably easier to use something like Pulumi than Terraform. See also: AWS CDK, Aspire etc.

The dream of domain experts writing code over software engineers interpreting the requirement often fails because it tends to create write only software, even assuming domain experts are willing to become DSL programmers in the first place.

In addition, the goal of the business user writing software via a DSL was always a myth:

So is this the hook – business people write the rules themselves? In general I don't think so. It's a lot of work to make an environment that allows business people to write their own rules. You have to make a comfortable editing tool, debugging tools, testing tools, and so on. You get most of the benefit of business facing DSLs by doing enough to allow business people to be able to read the rules. They can then review them for accuracy, talk about them with the developers and draft changes for developers to implement properly. Getting DSLs to be business readable is far less effort than business writable, but yields most of the benefits. There are times where it's worth making the effort to make the DSLs business-writable, but it's a more advanced goal.”
Fowler, Martin, 2008 DSL Q&A. See also Business Readable DSL

Avoid Writing Software Instead?

4GLs also declined because we moved into the era of SaaS. Instead of most businesses building bespoke software, the target of much of the 4GL revolution, many businesses switched to using SaaS offerings via the web. This is because most businesses don't want to turn their subject matter experts into shadow IT, they want to provide them with tools to do their job, and SaaS provided a way to do that.

The demand for “shadow IT” reduces under this model as it is met mostly by SaaS tools. Compute for many common tasks becomes a utility, competing on price and reliability over feature set.

For the SaaS vendor the goal of empowering end users to write code over software developers fades away; there are few advantages and many disadvantages of shifting code into the problem space than remaining in the compute space.

The Arrival of Prompt Engineering

What can the past tell us about the likely trajectory of LLMs driven by Prompt Engineering and their impact on software development. To recap our key points so far:

Essential Complexity cannot be reduced; improvements to tooling just reduce Accidental Complexity
Reductions to Accidental Complexity must focus on Ownership of software and not the Authorship as software spends most of its life in maintenance, being changed.

Interestingly enough, Brooks considered the impact of AI on programming back in 1986 and divided applications into two buckets, Automatic Programming (AI Developers) and Expert Systems (AI Copilots). Both are attempts to find a silver bullet to the software crisis through AI.

(For now I am not going to consider the idea of autonomous agents writing software – an attempt to increase the supply of engineers through the use of AI workers. Nothing about LLMs indicates to me that they can act autonomously. They require humans to give them tasks – in other words to program them).

Automatic Programming

AI developers are a case Brooks calls Automatic Programming. Automatic Programming is an attempt to give a computer natural language instructions to author code.

“In short, automatic programming always has been a euphemism for programming with a higher-level language than was presently available to the programmer.”

Parnas, David 1985. Quoted by Brooks, Frederick P.. The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition (2nd Edition)

Can an LLM instead be used to allow natural language to specify requirements in a given domain that can then be used to generate code? If so, does this lower accidental complexity?

Many proposals for AI replacing developers pitch themselves as a better language workbench – more of the app can be declaratively described in natural language and the code generated. Attempts at natural language programming (NLP) have often only allowed a constrained usage of language, but the promise of an LLM for automatic programming is that you can use Prompt Engineering techniques to write the code.

In this light, the LLM is being used as a 4GL as we are using it to generate 3GL statements. (We do have some the ability to generate 4GL statements too, I'm not sure that really makes an LLM a 5GL). Our 4GL will consist of Prompt Engineering statements. But rather than being truly declarative – describing what we want – Prompt Engineering uses triggering – pushing the LLM to trigger a response based on solutions to the problem it has been trained on.

The question is whether we can devise a toolchain or set of techniques for use with software written via Prompt Engineering statements that can solve the problems that we had with 4GLs to date?

Modifiability
Tools
Feature Set
Onboarding

Modifiability

We use Prompt Engineering to trigger the LLM into generating statements in an existing 3GL (or 4GL); by definition it can't do anything else as it must have existing statements to re-purpose.

This means that either:

Evaluation of that code can only be done by someone skilled in a 3GL (or 4GL)
The tooling needs to reach the point that we can maintain software written via Prompt Engineering without dropping down to the 3GL.

Whilst an LLM can only generate statements in a 3GL or 4GL, as of now they require human action or review; whilst this remains true we eliminate one of the largest benefits that Automatic Programming (and before them Workbenches and Low Code) seeks to obtain – the ability of a subject matter expert to use these tools to directly solve their problem.

Today, LLM based AI developers offer a different way to author the code, but still require software development skills to verify and use the results. This need for verification implies oversight via a developer who must be skilled in the 3GL or 4GL it outputs, due to the non-deterministic nature of the code produced.

We have seen before that the need for oversight by someone skilled in the 3GL tends to result work shifting to the 3GL to avoid multiple models (see Is Design Dead?).

If the technology reaches the point that it can do self-verification, such as self-debugging, then we move closer to enabling the subject matter expert can use the tool directly; at this point translation issues between the individual who understands the essential complexity of the problem and the code are removed. In theory accidental complexity fades away.

The risk is though that Prompt Engineering is just a replacement of one form of accidental complexity – which we are good at solving – with another form of accidental complexity, how to elicit the computable model we want, via Prompt Engineering – which we are lest good at solving (for now). The current state of Prompt Engineering, with its “spells” that must be incanted to work, just looks like another form of accidental complexity.

Despite advances like Chain of Thought we can't be sure that Prompt Engineering will enable us to describe a full set of requirements for an LLM. This remains to be proven particularly with regards to the problem of solving realistic problems at the scale of anything larger than snippets or functions). Even AI maximalists are aware of the problems:

A software engineer that could only write skeleton code for a single function when asked wouldn’t be very useful—software engineers are given a larger task, and they then go make a plan, understand relevant parts of the codebase or technical tools, write different modules and test them incrementally, debug errors, search over the space of possible solutions, and eventually submit a large pull request that’s the culmination of weeks of work. And so on...

Right now, models can’t do this yet. Even with recent advances in long-context, this longer context mostly only works for the consumption of tokens, not the production of tokens—after a while, the model goes off the rails or gets stuck. It’s not yet able to go away for a while to work on a problem or project on its own.
Aschenbrenner, Leopold 2024. The Situational Awareness: From GPT-4 to AGI: Counting the OOMs

Today it is trivial to write a prompt that will solve common coding problems – but these are only accidental complexity of the software engineering problem – and are often solved by frameworks and libraries.

Today HumanEval benchmarks can be useful for AI researchers, but they don't deal with the larger problems of establishing domain requirements for software (as the HumanEval paper itself points out).

There are two problems to relying on HumanEval:

HumanEval assumes that the preferred paradigm is that a software engineer drives with Test Driven Development. This falls far more into AI Assistant over Automatic Programming as a paradigm because you are writing 3GL and using code generation to get the tests to pass. It does not imply Automatic Programming.
HumanEval does not help us predict how good an LLM could become at translating essential complexity expressed as via Prompt Engineering into code.
The problem with zero-shot problems, where the LLM has not encountered the solution in its training set. HumanEval does not predict the ability to work with these zero-shot problems.

But what happens if we assume that the steady advance in Prompt Engineering techniques become good enough to generate a correct first version?

Perhaps we find that Prompt Engineering in steps (analysis, design, code, validate), dividing up the problem, much as we do with code, would allow us to prompt to reduce some of these difficulties. The LLM can then author chunks of 3GL in response to a request. Typically these are used to bridge knowledge gaps. These might be stitched into the full product, or be used for parts of the application, such as the domain logic, with the developers coding around them.

As soon as we have any problems or change requests a developer is faced, once again, with the classic round-tripping problem – they have to choose between:

Amend the code and abandon further change via the Prompt Engineering statements
Alter the Prompt Engineering code to generate the changed code.

This is the trade off we have seen before: eliminating the accidental complexity of authoring code has less value than we think because code spends most of its time in maintenance, so the maintainability of the code is more important than the speed to author it.

This then raises the question: how easy can tooling make it to work with our Prompt Engineering 4GL, over the 3GL that we have developed the language in? As we saw with other 4GLs, success is usually dependent on the toolchain.

Prompt Engineering is non-deterministic in nature; given two runs at the same set of prompts, we may generate different code. On the face of it, this would seem likely to have higher accidental complexity for modification. There are possible solutions to this problem in multi-agent approaches. But Prompt Engineering is not “natural language, it's statements that elicit particular responses from the training set; it is effectively “magic”. The likelihood is that code bases built via Prompt Engineering alone are difficult to maintain.

Even if we use a chunk based approach, the problem is that once we begin to use Prompt Engineering for chunks code, driven by tests written in a 3GL or which we stitch together with a 3GL, the more it will tend to be less accidental complexity to just write the code in the 3GL directly. This assumes someone experienced with a 3GL, but as we have asserted, this is a pre-requisite for using the tool.

By the time we break up the problem enough, we shift to Expert System (AI Copilot) anyway, which is a different goal to automatic programming.

For this to work as anything other than an AI Assistant, we would need a Prompt Engineering 4GL that we can use to reliably specify our requirements, without 3GL tests or stitching.

Nothing we have learned about working in high-level languages (type systems for example) suggests that Prompt Engineering as a 4GL is not likely to more time-consuming and error-prone than working directly in a 3GL when it comes to the ownership phase of our code; the same problem that approaches like XP reacted to. Optimistically we can make this work for narrow fields, just like other 4GLs.

For truly Automatic Programming, LLMs don't seem to offer anything we have not tried before, and abandoned. For modifiability, the AI Assistant approach seems more promising.

Tools

Due to the reliance on code generation, the toolchain for LLM based automatic programming ends up being the same toolchain that developers use to test, debug and observe code today.

If LLM generated code has a bug, how do you generate code without that bug? It is possible that feeding output back into an LLM allows it to correct itself (see self-debugging above), but the process if often slower for the engineer than simply applying the fix directly.

Given we use the existing tooling though there is certainly no “silver bullet” for accidental complexity to be removed by using an LLM tool, we must ultimately switch into using 3GL based tooling.

We know that successful DSLs provided a tool chain to debug that DSL; those that do not run into the wall of context switching from the DSL to the 3GL. Without tooling that can debug running prompt engineered solutions, it seems likely that LLM based automatic programming will meet the same problem.

If the toolchain requires 3GL skills, it does not change the game for software developers.

Feature Set

Of course an LLM driven by a prompt can only ever hope to remove accidental complexity by providing solutions based on prior art in this space. This is the feature set problem.

Many problems of accidental complexity are today solved by frameworks or libraries. Core value comes from solving domain problems (hence the interest in DSLs).

For the LLM to novel domain problems it is necessary to train LLMs on existing domain specific solutions, but these run afoul of the need for an exponential increase in data which we may not have. or the success of using techniques like Retrieval Augmented Generation (RAG) to generate code in response to novel domain problems. And mostly this implies existing code, probably proprietary and presumably even more protected with the rise of LLMs scraping the web, and already maintained by developers within the organization. Someone has to create that code, and organizations will soon starve themselves without a sufficient body of existing 3GL solutions to domain problems.

Of course domain problems can be broken down by functional decomposition – but this leads to the question: is this really easier via Prompt Generation over direct authoring in a 3GL. History suggests that the 3GL will win.

If we already have developers who work in the abstraction of 3GL to model the problem it is unlikely that switching them to Prompt Engineering will do anything other than slow them down. We create new accidental complexity that they are less experienced with.

As we develop more code by Prompt Engineering the question becomes what feeds the LLM with new or innovative solutions to the problem, new ways to model the answer? Prompt Engineering itself is unlikely to provide that. But the LLM can only parrot existing solutions to the problem.

This is particularly acute when we think about innovation in frameworks or libraries, or new targets for compute (such as occurred with the introduction of web or mobile devices even AI itself – simply put there is no corpus of material on which to train the LLM. Like a developer who refuses to accept the evolution of techniques and tools, and LLM will be doomed to keep creating legacy code unless it constantly learns, much like its human counterparts.)

The danger here is that organizations who use an LLM to deliver their core domain logic – their competitive advantage – will tend to halt any innovation on how they model that problem with their usage of an LLM (or be forced to rely on commoditised publicly available solutions).

Where a space is already commoditised and no advantage through an innovative internal product is possible, it is likely that a SaaS offering is a better way of reducing the cost of software, than authoring an LLM based solution. We have already moved away from end user authored software in many such cases.

Onboarding

Onboarding remains an unknown. We don't really have the experience to understand how easy it would be to use Prompt Engineering statements, as opposed to 3GL code, as a means to pass understanding between one team member and another on what the application does and how to maintain it.

But the flaw in prompts as a description of the code to be generated tends to be that natural language is a remarkably poor medium for expressing the model that we want to use in our software. Every time we move away from a 3GL to create a more natural interface, we rapidly get diminishing ability to author software.

Because the developers did not create the code, the LLM did, using the code to onboard is as difficult as any code hand over where the authors cannot explain their decisions, in this case the LLM – the LLM has effectively created legacy code.

Now an LLM is also good at explaining code. So it is possible that we will get better tools that can explain the code they have written, and thus make it more maintainable.

But this does not fix the problem of us knowing how to handover Prompt Engineering code. How good we can get at this, remains an unknown.

This is a difficult one, precisely because an LLM is so new. But the lack of success with DSLs being used to write domain code, or even share that code with subject matter experts, implies that this remains a key challenge to LLMs. Every time we try this, we revert back to high-level code as the shared medium of understanding. So how good are Prompt Engineering statements when compared to high-level code as the shared medium of understanding.

There is also the problem of the amount of code we already have. Retail banks and government agencies still maintain code written in COBOL in 2024. Unless we can reverse engineer that back into Prompt Engineering statements, we will continue to need developers whose paradigm is working with high-level languages for years to come. If it was economic to have re-written it, we would have done so. Those developers will continue to find that unless 4GL tooling improves, when they write new projects it is easier to stick with a 3GL.

The innovation-adoption model suggests that in order to cross the chasm from early adopters to majority you have to be pulled because you solve problems they have today. Developers will be slow to pull a Prompt Engineering language outside of a greenfield project, because most will find it easier to keep working in a 3GL. This will significantly slow adoption. Much as when we introduced other new paradigms in software engineering this may take 10 years. Even then like DSLs or functional programming, it may fail to cross to the majority in full, instead being harvested for ideas by 3GL environments.

Limited to Accidental Complexity

The LLM cannot solve the problems of essential complexity, which must be tackled via the accidental complexity of our Prompt Engineering. Given Brooks assertion around accidental complexity this can't give an order of magnitude increase in productivity. Even if we manage to create a tool chain that resolves many of the problems of DSLs, Prompt Engineering solves accidental complexity but not essential complexity. Developers just use different tools to do the work; but there are still developers.

If an LLM can only ever solve accidental complexity, the question at issues is: does an LLM reduce the accidental complexity of authoring a program, or increase it?

The problem here is that given the organization already has developers skilled with the domain, is writing a prompt to build more code genuinely more productive than writing the code? Certainly not by an order of magnitude.

I remain skeptical. Repeatedly we seem to discover that the 3GL is the sweet-spot between abstracting instructions for the computer and human beings. There is little to suggest that automatic programming of LLMs via Prompt Engineering is a simpler paradigm to create programs than just coding them, particularly once we get to maintenance, which represents the cost of the software. This focus on authoring over ownership is the same set of problems we have previously encountered with 4GLs.

Expert Systems

Brooks opinion on expert systems hold relevance today, when we are finally delivering them via LLMs:

“An expert system is a program containing a generalized inference engine and a rule base, designed to take input data and assumptions and explore the logical consequences through the inferences derivable from the rule base, yielding conclusions and advice, and offering to explain its results by retracing its reasoning for the user. The inference engines typically can deal with fuzzy or probabilistic data and rules in addition to purely deterministic logic.... accurately. I believe the most important advance offered by the technology is the separation of the application complexity from the program itself.
Brooks, Frederick P.. The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition (2nd Edition)

Copilot tools are classic expert systems. With LLMs they provide probabilistic advice on what code to write. Provided the LLM has been trained on similar code it should be able to help, as Brooks predicted, separate application complexity from the program itself.

This is already a proven use case. The capacity of an LLM to act as an expert system via co-pilot seems to offer valuable assistance in reducing accidental complexity. As ever, some developes benefit more from expert assistance than others and the tendency of LLMs to hallucinate libraries that do not exist, means some developers will turn it off as less efficient.

You may note this yourself. Where you know how to solve a problem, the co-pilot may seem to obstruct you over helping you move faster; you have to still know enough to validate its results. Developers have long used helpful tools like Intellisense to capitalize on the “recognition is easier than recall” observation and LLMs can help with that, much as predictive text can help with typing on a phone.

But for many engineers an LLM can effectively short-cut the process of:

Research a solution to a problem of accidental complexity (usually a search of relevant sources).
Read any relevant documentation and code samples.
Implement accordingly (may involve copypasta and fix – this if fine in a TDD green step for example).

An AI Assistant speeds up this process by removing the mechanical elements of searching literature and reading the documentation to solve a well-known problem.

An AI Assistant solves some of the problems of token length in LLMs which inhibit creation of larger projects, because they implicitly work with units of your codebase. Improvements in how these tools perform can capitalize on this fact.

There are also many yak shaving tasks in any project. These just add accidental complexity. Purposing an AI to solve these tasks would help enormously both with our productivity and with the joy of coding.

Unlike you or I, an AI Assistant can only help you with solutions to well-understood and documented problems. If we need to use a new technology that has few posts on Stack Overflow or poor documentation, we can just spike it with a REPL, test, or prototype. An AI Assistant based on an LLM just can't do this.

A problematic anti-pattern here is that AI Assistants hamper innovation because developers begin to avoid using tools that cannot be supported by AI due to a lack of existing documentation and samples. This can become a feedback loop where the lack of new code due to the lack of AI Support starves new frameworks of developer interest. This might mean that an AI Assistant becomes counter-productive reducing innovation in favor of brute force. This will increase the accidental complexity of the code it generates for that most important phase: maintainance.

However, an expert system can provide some help with maintenance, such as explaining code, finding bugs (although humans are currently better at this task) or even suggesting modifications to existing code. As such, it does not suffer from the downsides of code generation where value only exists within authoring not modification.

Of course, it seems unlikely to provide an order of magnitude increase and act as a silver bullet, given that there are unlikely to be order of magnitude increases in accidental complexity. There is probably a point where the accidental complexity of maintaining large amounts of code generated by an LLM becomes greater than the accidental complexity of supplementing your code with snippets created by an expert that creates a ceiling that prevents this being a silver bullet.

Silicon Snake Oil?

What conclusions can we draw from this? The discourse around LLM capabilities is dominated by inflated claims (see above, HumanEval is not representative of the problems human software engineers solve) and snake oil salesman that drown out reasonable assessment of claims for this technology. Sometimes it can feel hard to assess: is it just silicon snake oil, or is there something more real here?

The Nvidia CEO may be successful at driving up his companies share price, (and after all that is his job) but I would be very skeptical about his claims at this point.

AI Developers

This seems unlikely to bear fruit in the next 10 years.

The burden of proof for faster advances than this has to lie with its advocates who need to prove they can solve the problems of: onboarding, modifiability, feature set and tooling.

They would need to prove that we can use Prompt Engineering as a 4GL.

Whilst there are advances in Prompt Engineering, we seem to be a long way, and billions of dollars, from solving the set of problems that would make 4GLs displace 3GLs outside of narrow contexts.

We have already spent billions pursuing this goal with 4GLs, and only previously solved the problem in narrow context. None of these resulted in a silver bullet. Nor can they tackle essential complexity, only accidental complexity.

There is a lot of hype here. But all the reports about capabilities are always a “private demonstration” or “next year”. When we have seen anything in the open, it has quickly been debunked as lies.

Given the track record of the industry making false claims, it is reasonable for us to take a doubtful position and ask for proof. Without evidence we should assume that there is not the progress being claimed.

Even if we succeed at building 4GLs (finally!) using LLMs, then developers will still be needed; the most transformative outcome here is that developers use a 4GL for more tasks, and become more productive at a higher abstraction, not that they disappear.

Personally, it seems unwise to me to spend billions competing at this goal until we have better answers as to how we solve the problems with this approach. Otherwise those billions will be wasted. The first question potential investors ask: have you solved the problem of ownership?

AI Assistant

This has already born fruit.

Perhaps not the full-spectrum of expert system for developers that Brooks envisaged 40-years ago, but certainly progress towards it.

This will reduce the accidental complexity of development, although not by an order of magnitude, and not tackle the essential complexity, for which human developers are still required.

Co-pilot tools, of various stripes, are likely to be an increasingly valuable part of your toolkit. Many tools which might pitch themselves as Automatic Programming are really an AI Assistant. This is true even with tools that promise they will be able to help with the ownership phase not just the authorship phase. There may be a lot of improvements to how we manage legacy code possible through an AI Assitant. That would be an enormous win for many of us. There may be increased joy in our roles by removing toil

It is perhaps worth noting that the value of AI Assistants are not limited to developers. End users driving workbenches/low code tools like Power Platform also have access to AI Assistants, along with spreadsheets like Excel. This will make life easier for low code developers, but is unlikely to be any more of a threat to developer roles than these workbenches/low code tools are today.

Will AI Assistants result in layoffs, or lower hiring? Increased productivity does not always imply layoffs. Subsequent shifts, for example 2GL to 3GL led to increased employment.

“In the end, when you step out of the vacuum of just the specific productivity gain of a particular job function, and look at how the whole system will adapt and improve due to that productivity gain, a very different picture of AI's impact on jobs will emerge. Yes there will absolutely be changes to what jobs become more or less in demand in the future, but the competitive nature of companies inevitably ensures that across the whole system companies will be focused on leveraging AI to become more productive.”
Levie, Aaron @levie

Most developers don't lose sleep that Power Platform or Excel will put them out of a job though in many cases a spreadsheet is exactly the tool that the end user needs, not your code.

The reality is that we have made successive attempts to move developers to the higher ground of domain problems – the essential complexity – and away from the low ground of technical problems – the accidental complexity. Domain Driven Design was centered on the idea that modelling the domain is the key problem, not the accidental complexity of writing software.

It is interesting to note how those old ideas: TDD & Domain Driven Design that many have evangelize thrive in this world:

TDD becomes a powerful tool when you ask the AI to implement code for your tests
Domain Driven Design becomes a powerful tool when you focus your modelling efforts on the domain.

Using TDD to model behaviors, using AI help to implement those behaviors, and then stitching those behaviors together is a powerful technique today – it only becomes more powerful when working with an AI to help you.

What do you need to do today? Learn how to use an AI Assistant. Become familiar with Copilot tools, but also develop Prompt Engineering skills so you know how to get results.

This will be like learning any new skill. At first it will be slow, painful and seem counter-productive, so you may not want to do it “under pressure”. That is the position of all of us right now. But as you play with these tools they will become part of your toolkit, and you can evaluate their usage of “just write the code” based on what is more performant for you.

Understand AI Engineering in the context of how to use AI as middleware; this may be as, if not more, important to your professional development than worrying about a 4GL appearing.

AI Winter

The AI field has certainly grown beyond the disappointments that led to the Lighthill Report and the first AI Winter. The interest in AI has grown with LLMs, but so have the expectations.

The hype cycle that has promised a revolution that is unlikely to be realized in software as it will not result in a silver bullet.

Often the disillusionment that follows the failure of the hype of a technology is followed by a collapse in investment, but we also get a more realistic appraisal of the value of these tools.

It is at this point that we can genuinely understand how to incorporate these tools into our workflow. Of course the evidence of previous waves of innovation and their attempts to find a silver bullet suggest that this point of realization might take some time to emerge.