The AI problems hype won’t solve

From opaque data and fragile agents to compliance gaps and accountability, the hardest AI problems begin where the demos end.

Jul 05, 2026

Every day brings another promise about AI: autonomous agents, exponential productivity, one-person companies, software that builds itself.

Some of this is real. The models are improving, and many of the products built around them are genuinely useful. But the public conversation mostly still focuses on what a model can do. That is only the beginning. The conversation needs to expand. Beyond the demos, prompts, skills, MCPs, connectors, and plugins, the harder conversation is about what these systems can reliably, safely, legally, and economically support once they become infrastructure.

Once AI becomes part of a company, hospital, law firm, government agency, or financial system, a different set of questions takes over. Can people trust the data behind it? Can it forget information it should not have? Will it behave the same way next month? Who is responsible when an agent makes ten small mistakes and causes one large failure?

Researchers and engineers are working on these problems, but they rarely fit into a product launch or a new model card. To make sense of them, I group them into four areas: (1) the data behind the models, (2) the way models learn and reason, (3) the difficulty of running them in production, and (4) the rules needed to use them responsibly.

1. The data foundation

AI systems begin with data. We know that in the abstract, but we often know surprisingly little about the actual material used to train a particular model.

1a. We cannot see where the data came from

Training data shapes what a model knows, whose culture it represents, which biases it repeats, what it memorizes, and which copyright or privacy risks it carries.

The Foundation Model Transparency Index has tracked how much major AI companies disclose about their models. Its 2024 report found some improvement over 2023, but continued secrecy around training data, copyright, data labor, downstream effects, and monitoring. In 2025, the index found that transparency had declined again. Companies revealed particularly little about their training data, computing resources, and how deployed models were being used.

This matters because models may learn from books, scientific papers, open-source code, journalism, images, websites, user conversations, company records, and synthetic content produced by other models. For much of that material, outsiders cannot answer basic questions. Who created it? Was it licensed or scraped? Was it private or copyrighted? Did a person write it, or did another model generate it? Can anyone independently check the answers?

Too often, the public is asked to trust the company that built the model. That is a weak foundation for infrastructure.

1b. We cannot reliably prove what a model saw

Even when a company says that it did or did not use a particular dataset, verification is difficult.

Suppose a publisher wants to know whether a copyrighted book helped train a model, or a company wants to check whether its private source code was included. Researchers still lack a dependable way to inspect a closed model and prove that a specific dataset influenced it. They must also separate memorization, where a model can reproduce material, from generalization, where the material changed a broader pattern the model learned. E.g., see data detection, Membership Inference Attacks, Shokri et al. 2017.

This leaves copyright holders, companies, researchers, and regulators in an awkward position. Policies may require documentation, but the underlying claims can remain hard to test. Compliance becomes heavy on paperwork and light on evidence.

1c. Deleting a record does not make a model forget

Privacy law is built around the idea that data can be deleted. That works reasonably well in traditional software. A company can remove a database record, delete a file, or allow an old log to expire.

A neural network does not keep each piece of information in a neat, isolated record. Training data can affect model weights, embeddings, fine-tuned versions, retrieval indexes, evaluation sets, logs, and later models created through distillation. Removing the original file may leave many of those effects intact.

Researchers call the attempt to remove those effects “model unlearning” (e.g., see Machine Unlearning, by Bourtoule et al. (2021)). The phrase sounds simple, but the test is not. Has a model forgotten something when it stops repeating the exact text? What if it can still infer the information? What if the data changed an association or capability? What if that knowledge has already passed into another model (i.e., model provenance)?

This unresolved problem sits underneath privacy, copyright, compliance, and the right to be forgotten. The industry likes to describe memory as a feature. It has said much less about forgetting as an obligation.

1d. The people who produce the data rarely share the gains

Modern AI depends on human work at an enormous scale: writing, code, research, music, art, photographs, videos, documentation, forum posts, annotations, and everyday online activity. Most of the economic value, however, flows to the companies that build and operate the models.

The people and institutions that produced the source material usually receive no payment, attribution, consent mechanism, or bargaining power. Jaron Lanier and E. Glen Weyl have argued for “data dignity”, an approach in which people have more control over the data they create and can share in its value. Foundation models make that old proposal much more urgent.

A search engine indexes the web and usually directs readers back to the source. A model can absorb patterns from the same material and produce an answer that competes with the writer, artist, publisher, or programmer who created it.

Copyright lawsuits are one part of this debate. The larger question is whether we need better systems for consent, attribution, licensing, royalties, data trusts, collective bargaining, or markets for high-quality data.

There is a practical concern too. Future models will need fresher, more specialized, and more carefully maintained information. If people have little reason to produce or license that material, the quality of the data supply will decline. AI companies need access to data, but society also needs an economic model that keeps human knowledge production alive.

2. Learning, memory, and understanding

More data and larger models do not automatically create systems that can keep learning, use long documents reliably, or explain how they reached a result.

2a. Continuous learning creates a moving target

Most foundation models are released in as a series of fixed versions. A lab trains a model, adjusts it, evaluates it, deploys it, and replaces it with a new version months later. Fixed versions are easier to test, compare, reproduce, and govern.

The world, of course, does not wait for the next model release. Laws change. New scientific results appear. Tools and user needs evolve. A useful AI system should adapt to this information without absorbing malicious content, losing old capabilities, or changing in ways nobody can trace.

That creates a difficult tradeoff. Who decides what a model learns? How do we test a system that changes every day? Can we reproduce an answer it gave six months ago? Can we reverse a harmful update? How do we tell useful adaptation from contamination?

AI companies talk often about model updates. A continuously learning model (e.g., Continuous Thought Machines, Darlow et al. 2025, Continuum Memory Systems (CMS) proposed in Nested Learning: The Illusion of Deep Learning Architecture, Behrouz et al. 2025) is a harder proposition because the thing being evaluated is never quite fixed.

2b. A large context window is not a reliable memory

Model providers often advertise how many words, pages, or tokens a model can accept at once. This is useful, but capacity is not the same as comprehension.

Research on the “Lost in the Middle” problem (Liu et al., 2024) found that models can miss relevant information when it appears in the middle of a long input. A model may accept an entire legal case, codebase, or research archive without giving equal attention to every part.

This creates an easy trap for users: “I gave the model everything, so it must have considered everything”. The better question is whether the model found the right information, treated it as important, reasoned correctly about it, and cited it accurately.

That difference matters in legal work, software engineering, medicine, science, finance, compliance, and company search. A bigger context window can hold more text. It does not, by itself, provide a trustworthy working memory.

2c. We still do not understand what models learn internally

Mechanistic interpretability is the effort to reverse-engineer the calculations inside a neural network. Instead of judging only its inputs and outputs, researchers look for the internal features, circuits, and algorithms that produce its behavior.

Work by Neel Nanda and collaborators on “grokking” (Nanda et al., 2023) conveys both the promise and the difficulty. The researchers studied small transformers trained on modular addition and reconstructed the algorithm those models had learned. Their analysis showed that an apparent leap in performance had been building internally over time.

That is impressive work on small models solving a narrow mathematical task. Frontier models are vastly larger and more general. Their internal representations remain mostly opaque (e.g., Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention, Huang et al., 2026).

Benchmarks can show what a model appears capable of doing. Interpretability aims to explain how it works and why it sometimes fails. We need both if these systems are going to influence safety-critical decisions.

3. From a good demo to a dependable system

A model can perform well in a controlled test and still be a fragile part of a real product. Production systems add changing vendors, hostile inputs, external tools, sensitive data, and long chains of actions.

3a. AI applications drift

Models change. Safety filters change (e.g., Anthropic Fable 5's safeguards). Prompts that worked last month stop working. Retrieval quality declines. Tool responses shift. Latency and prices move. A fine-tuned model drifts away from the behavior of its base model. Users also change how they interact with the product.

Traditional software teams use versioning, regression tests, monitoring, stable interfaces, and rollback plans to manage change. AI products and teams need the same discipline, but their failures are harder to spot because they are often semantic.

A function either compiles or it does not. A model can produce a polished answer that is subtly wrong. The system may keep running while its quality slowly deteriorates.

3b. Language is both the interface and the attack surface

Mark Russinovich groups three recurring risks under “The Price of Intelligence” (2025): hallucination, indirect prompt injection, and jailbreaks.

Hallucination is more than an occasional factual mistake. Language models generate probable continuations, even when their information is incomplete or uncertain. The answer can sound confident because fluency and accuracy are different properties.

Indirect prompt injection appears when a model reads hostile instructions hidden in outside content. That content might come from a website, email, document, support ticket, code repository, PDF, or company chat. The model may confuse text it should analyze with an instruction it should follow.

Jailbreaks exploit a related weakness. Natural language is the instruction layer, but it is also where attackers try to bypass the safety rules.

These risks grow when a model can act. A chatbot that invents a fact is frustrating. An agent that invents a fact and then uses it to send email, change code, access customer records, move money, or modify cloud infrastructure can cause direct harm.

3c. Agents can turn small mistakes into a large failure

AI agents are expected to break a goal into steps, choose tools, inspect the results, adjust the plan, and keep going. Each step creates another chance for error (e.g., Where LLM Agents fail and how they can learn from failures, Zhu et al., 2025).

An agent may misunderstand the request and call the wrong tool. That tool returns a misleading state. The agent treats the state as valid, makes another decision, and then gives a confident explanation of a path that was wrong from the start.

The resulting failure may not contain one dramatic hallucination. It can emerge from a chain of small, plausible mistakes that gradually push the task off course. In a demo, this may be funny. In engineering, finance, healthcare, law, infrastructure, or security, it can be dangerous.

Agent reliability, therefore, depends on more than a better model. The surrounding system needs permissions, checks, clear stopping conditions, independent verification, and a way to recover when a step goes wrong.

4. The rules for serious use

The most valuable AI applications often involve the most sensitive information. That is where consumer product assumptions collide with professional duties and public accountability.

4a. A warning label is not a compliance system

Lawyers must protect privileged information. Healthcare workers handle confidential medical records. Tax authorities, financial advisers, and banks operate under their own secrecy and regulatory duties. Companies hold personal data, source code, trade secrets, security records, and internal strategy.

Telling these users not to paste sensitive information into a chatbot/assistant does not solve the problem (see OpenAI's disclaimer on Privacy controls in ChatGPT and Anthropic, which follow the same line). Professional AI needs clear commitments about data retention, training use, jurisdiction, encryption, access control, vendor access, audit logs, deletion, incident response, and liability. In such a scenario, AI insurance becomes a thing (e.g., Klaimee, Kinro).

Without those protections, many high-value uses remain legally uncertain or operationally unsafe. The distance between a weekend AI prototype and a system a large organization can trust is filled with these unglamorous requirements.

4b. Public benchmarks show only part of the picture

Benchmarks are useful, but they are not reality. Test sets can leak into training data, become saturated, reward shallow pattern matching, and miss rare failures with serious consequences. Most benchmarks say little about compliance, security, reliability over time, or an agent’s ability to complete a long task.

Some of the evaluations that matter most are also private. How do AI labs test deception, autonomy, cyber capability, persuasion, biological risk, tool misuse, data leakage, or long-range planning? Which failures do they find internally? Which risks do they decide are acceptable before release? E.g., see for yourself by checking the official documents they release: Anthropic Transparency Hub - Anthropic System Card: Claude Fable 5 & Claude Mythos 5, and OpenAI Preparedness Framework - GPT-5.6 Preview System card.

The public sees leaderboards (e.g., Artificial Analysis, Arena.ai, HuggingFace), product demos, and selected safety reports. It rarely sees the full range and frequency of failures. That makes it difficult for customers, researchers, and governments to judge whether a model is ready for a particular use.

Capability is only the first test

The divide in AI is not between believers and skeptics. It is between what a model can demonstrate and what a responsible institution can depend on and trust.

More computing power may improve capability. More data may broaden coverage. Larger context windows may help with some tasks. Better training may improve behavior. None of those advances, on their own, tell us where the data came from, whether it can be removed, how an agent will behave after a chain of mistakes, or who carries responsibility when the system causes harm.

If AI is going to become infrastructure, progress cannot mean capability alone. It must also mean that these systems become easier to inspect, govern, secure, correct, and hold accountable. Those problems are less exciting than a polished demo. They are also the work that determines whether the demo can survive contact with the real world.

grayscale photo of person holding flower — Photo by Benjamin Zanatta on Unsplash

If you found this useful, please cite this write-up as:

Müller, Lucas. (Jul 2026). The AI problems hype won’t solve. lucasmuller.com. https://notes.lucasmuller.com/p/the-ai-problems-hype-wont-solve

@article{lucasmuller2026default,
  title   = {The AI problems hype won’t solve},
  author  = {Müller, Lucas},
  journal = {lucasmuller.com},
  year    = {2026},
  month   = {Jul},
  url     = {https://notes.lucasmuller.com/p/the-ai-problems-hype-wont-solve}
}

Lucas Müller Notes

Discussion about this post

Ready for more?