tl;dr:
- Today’s AI systems lack crucial capabilities (e.g. memory, learning from experience, …)
- Developers find themselves repeatedly engineering workaround solutions for missing capabilities
- Workaround solutions are short-lived, as they are quickly replaced by superior end-to-end versions.
- The engineering effort required to build AI products drops as more capabilities are built directly into the models.
Lukas Peterson recently expressed a similar opinion in his blog post (https://lukaspetersson.com/blog/2025/bitter-vertical/), which served as inspiration for my own reflections. Let me share some experiences from projects my team and I delivered for clients at the AI consultancy I co-founded:
Insurance Claims Analysis
One of our clients, an insurance provider, had a team that manually reviewed long documents (20–500 pages) to assess claims. Our mission was to create a system that automatically extracts key information relevant to assessing the validity of claims.
We first tried a RAG approach, but we ended up designing a multi-stage, iterative AI pipeline to ensure the accuracy and completeness of extracted information. Building this solution took about three months. Just as we finished, OpenAI released “DeepResearch”. With this new feature, our heavily engineered AI pipeline would have been unnecessary. We could have built the same tool in just a few weeks with even better accuracy.
Key takeaway: the moment a more advanced AI feature or model arrives, any hand-engineered workaround is rapidly overshadowed.
Head Hunting Tool
A headhunting agency tasked us to build a tool that identifies potential candidates for a specific position and conducts extensive web research to collect key data points such as age, languages, experience, expertise, and seniority. It then evaluates each candidate against the criteria set by the headhunter.
Again, we missed the advantage of DeepResearch. We had to create our own version, which was not as good as the end-to-end solution from OpenAI.
Accounting Automation
Several accounting firms asked us to build AI tools that would automate booking of invoices, bank receipts, credit cards, etc. We started to develop AI solutions that plug into their existing accounting software and workflows. These solutions helped boost productivity, but the impact was limited.
Why? It turns out accountants spend most of their time on a small subset of particularly challenging cases. While 90% of invoices are straightforward and the AI's booking suggestions can be accepted as-is, the remaining 10% are much more complex. For example:
- Invoices from foreign countries with complex tax implications
- Missing information about a purchase → Accountant must track down and contact the person who was responsible for the purchase
- Missing invoices → Accountant goes on a paper trail hunt
Truly automating accounting means solving the hard 10%. I believe developing such a system with today’s AI is quite challenging and would require a considerable amount of engineering.
Here is a table of capabilities that I found myself engineering workarounds for again and again, simply because these capabilities were or still are not integrated end-to-end into AI models. Clearly, many other developers face the same limitations, which is why platforms like LangChain have emerged.
| Capability | When It's Needed | Example in Accounting | Workaround Solutions | End-To-End AI Solutions |
|---|---|---|---|---|
| Reasoning | Any non-trivial task (e.g., coding, math, medicine) | Complex decision, e.g. choosing the correct tax account. | Manual chain-of-thought prompting | Reasoning models (o1, grok-3, R1, …) |
| (Re)search | Finding specific information in vast amounts of data | Finding the correct invoice related to a given bank transaction. | (agentic) RAG | Deep research (OpenAI) |
| Memory | Retaining context beyond the model's context length | AI agent asks a human for context about a purchase. If the same purchase occurs again next month, the AI should remember the conversation from last month instead of asking again. | - Summarizing/compressing past messages - Detecting key pieces of information and writing it to an external memory (e.g., a text file) that is later reinjected into the context (ChatGPT’s memory feature works like this) |
- Infinite context - Memory augmented models - Mamba architecture |
| Learning from Experience | As I mentioned in a previous post, there exists a kind of knowledge (tacit knowledge) that cannot simply be captured in words. Usually, this is knowledge that comes through experience or from trial and error. | AI makes a mistake in a booking, which gets noticed and corrected by a human. | - Reflection prompting + adding result to memory - Fine-tuning |
- End-to-end version of reflection prompting and memorization - Inference-time learning |
| Computer Use | Automating tasks where the involved software tools lack complete APIs. | Many (older) accounting systems offer no API. | Feeding screenshots to vision models and prompting them to come up with the next best action. | Models specifically trained with RL for computer use, e.g. OpenAI Operator, Anthropic Computer Use |
| Learning from Demonstration | Learning to complete a task by getting a step-by-step explanation from the human (e.g. via screen share and audio) | Accountant demonstrates to the AI where to archive a receipt and which naming conventions to use for the files. | One model analyzes the human demo and outputs a text description which is fed into the context of another model which performs the task regularly. | Probably automatically solved by having end-to-end memory and learning from experience. |
My guess is that model providers are racing to create AIs that incorporate all of the above capabilities natively. This means the continuation of the following two trends that I have observed so far:
- Workaround solutions are short-lived, as they are quickly replaced by superior end-to-end versions.
- The engineering effort required to build AI products drops as more capabilities are built directly into the models.
What does this mean for AI application layer startups?
I think most AI startups today are essentially building these 3 things:
- A nice UI
- (Boilerplate) backend code (databases, APIs, workflow logic, …)
- Engineering workarounds for missing AI capabilities
What if all of the capabilities I listed above become available out of the box, which is likely the case in 1-2 years, does that mean AI application layer startups are doomed? What we know is that in the short term these startups provide significant value. Take Cursor, for example. I use it daily, and it has made me ~5× faster at coding. However, once capabilities like computer use become both reliable and cheap enough, tools like Cursor will be obsolete. Instead, I will simply task my computer use AI agent to code directly using my favorite development environment.
So, how can AI application layer startups remain relevant?
One way I see is to leverage proprietary data gathered during this short transitional period, when end-to-end AI agents were not yet available. With that unique data, they could either enhance future third party AI agents or build their own AI agents from scratch. The feasibility of the latter option hinges on whether capable AI eventually becomes a commodity, which currently does seem possible.
But there is one caveat: Emerging research increasingly suggests that pure RL approaches to train agents may eventually outperform hybrid methods that incorporate data with human demonstrations. It remains to be seen just how valuable proprietary data will ultimately be.
The AI landscape is changing fast. My experience over the past 1.5 years suggests that while engineering workarounds can bridge today’s gaps, the future belongs to AI systems that integrate crucial capabilities end-to-end. Less manual engineering isn’t just a design preference, it’s essential for unlocking AI’s full potential. In the long run, only true end-to-end AI solutions will thrive, while today’s patchwork approaches will quickly become relics of the past.
For me the key question still remains: How can we build lasting value when foundation models become more and more powerful?