Prompting Is Programming: Unlocking the Hidden Intelligence of LLMs
Beyond Pattern Matching: Towards Genuine AI Understanding
You’ve probably seen the headlines: “GPT-5 achieves PhD-level intelligence.”
And you’ve also seen the screenshots of its comically dumb mistakes.
How can the same model swing between brilliance and blunders?
That’s the puzzle I’ll tackle in this post. And here’s what I’ll argue:
With the right structure, even a weaker LLM can perform like a top-tier reasoning model. You don’t need fine-tuning—just carefully designed prompts.
LLMs need symbolic grounding and structural guidance. Treating them as free-form text generators guarantees instability.
For practitioners and industry applications: don’t just ask questions casually—program your prompts. Structure them algorithmically and localize critical information (even with redundancy). This dramatically improves reliability.
The Distribution of Intelligence
The key to understanding these contradictions is this: an LLM is not a single, static intelligence. It’s a distribution of behaviors that depends on the input context. Given the same model:
Sometimes you’ll get an answer that feels brilliant.
Other times, you’ll get something that looks careless or even nonsensical.
Both are “true” samples from the distribution. Quoting a single failure or single success isn’t representative.
We need better qualifiers:
Intelligence capacity (at max): For a given task, how far could the model go if we optimized the prompt perfectly?
Intelligence on average: How well does it perform with a typical, unoptimized prompt?
Intelligence variance: How stable is it when you rephrase the same intent with small syntactic changes?
The new prompting theory result matters because it tells us that capacity is there. The challenge is accessing it.
Is Prompting Turing Complete with a given LLM?
When people say “LLMs are universal computers in disguise,” they’re not just speaking metaphorically. A recent ICLR 2025 paper — Ask, and It Shall Be Given: On the Turing Completeness of Prompting— proved that, in theory, you can encode any computable function into a finite prompt, and a single Transformer can execute it.
This rests on the idea of Turing completeness. Put simply:
A system is Turing-complete if, given enough time and memory, it can compute anything that is computable.
Alan Turing formalized this in the 1930s with the notion of the Turing machine, which defined the limits of computation.
Many things that look simple — like Python, spreadsheets, or even Conway’s Game of Life — are Turing-complete.
But here’s the catch:
👉 Being Turing-complete doesn’t mean being efficient or practical. It just means “possible.” You can, for instance, in theory build a skyscraper in Manhattan using spagetti pieces.
That distinction matters. It’s one thing to say a model could check whether a 10,000-digit number is prime. It’s another to actually make it do so reliably, today, under realistic constraints.
Theory Meets Friction
So, in theory, LLMs can do anything. But in practice, they stumble on surprisingly simple tasks.
Take prime checking: Turing-completeness tells us that with the right prompt, an LLM could decide whether a 10,000-digit number is prime. That’s the “possible.”
Now take multiplication: ask GPT-4o to multiply two 10-digit numbers, and you’ll often get garbage. That’s the “practical.”
Why the gap?
It’s not arithmetic. Models are rock-solid on single-digit multiplications.
It’s composition and memory. Long computations require chaining many small steps, carrying state forward correctly. That’s where today’s autoregressive models struggle: attention fades, errors accumulate, and intermediate results slip out of reach.
This is where the cracks in “just prompting” show up — and where structure begins to matter.
The Long Multiplication Benchmark
Ramus Lindahl created a benchmark repo for CoT prompting across models on digit-by-digit multiplication . Results (percentage of correct answers out of 50 samples per digit-length) show just how quickly models collapse without explicit scaffolding:
Even state-of-the-art models degrade sharply past 5–6 digits under CoT prompting alone.
A similar observation is reported by Yuntian Deng on X platform:

My Experiment: Multiplying 7–10 Digit Numbers
I tested both GPT-4o and GPT-4.1-mini with three prompt styles:
Simple Prompt: “Multiply these two numbers.”
Chain-of-Thought (CoT) Prompt: “Think step by step.”
Algorithm Prompt: Explicitly provided a step-by-step multiplication algorithm, with instructions to rewrite intermediate results at each step.
Expectation vs Reality
Previous reports (CoT prompting only) suggested:
gpt-4o collapses past 7×7 (0–12% accuracy).
o1-mini and o3-mini achieve 87–97% on 7×7, but degrade at 10×10.
My experiments is similar for the CoT prompting. But showing a different picture for structured prompting. When given Algorithmic prompts:
gpt-4.1-mini (non-reasoning model) reached 100% on 7×7 and 70% on 10×10.
Even gpt-4o (earlier non-reasoning model) jumped to 100% on 7×7 and 70% on 10×10 from basically 0% CoT performance, with exactly the same structured prompt.
👉 The lesson: Don’t just blame model performance. With explicit structure, even “weaker” models can leapfrog their expected performance.
Why Explicit Structure Works
The improvement didn’t come from “making the model smarter.” It came from scaffolding the task into small, reliable pieces:
LLMs are flawless at single-digit multiplication.
Long multiplication reduces big problems into small local multiplications plus simple memory (carry tracking).
Attention is unreliable for recalling old intermediate states.
The algorithmic prompt forced the model to refresh memory after each step by rewriting partial results. It never had to juggle the full problem — only small subproblems, updated iteratively.
The prompt itself became a program template.
This isn’t just about math, it’s about how we design AI systems.
Broader Implications
This bridges theory and practice:
Naive prompting (“just multiply”) fails because the model has no scaffold.
CoT prompting (“think step by step”) helps a little, but variance remains high.
Algorithmic prompting — explicit, structured, memory-updating scaffolds — lets even smaller models generalize beyond their “default” limits.
This mirrors Yuntian Deng’s observation: supervised-only models (like GPT-4o) collapse quickly, RL-trained reasoning models (like o1) extend further, but scaffolding via prompt structure alone can push non-reasoning weaker models into stronger performance regimes.
The Real Bottleneck: Structure, Not Models
This gap between capacity and practicality isn’t unique to arithmetic. A recent MIT study found that for 95% of companies, generative AI implementation is falling short. Strikingly, the problem wasn’t model performance or regulation, but a learning gap in tools and organizations.
Executives often blame model quality, but the research suggests otherwise:
Generic, flexible tools like ChatGPT shine for individuals,
Yet they stall in enterprises because they don’t adapt to workflows or provide structural scaffolds.
This resonates directly with what we saw in multiplication. The models may already have the capacity. What’s missing is structure. When given scaffolds, rules, or templates, they can suddenly leap far beyond their “average” performance.
Beyond Prompting: Structure as First-Class Citizen
What these experiments reveal is a deeper principle:
AI needs to be symbolically grounded and structurally guided.
Current LLMs are auto-regressive token generators: each new token depends only on the previous context. If an early mistake slips in, it cascades, derailing the whole computation.
But when we provide explicit structure in the prompt, we constrain the model’s path and prevent drift. This lets us unlock its true intelligence capacity.
Ideally, this structural guidance should not just be hacked in via long prompts — it should be part of the model architecture itself. Instead of sampling from the entire vocabulary at each step, generation could be guided by a symbolic contract.
One promising way forward is to integrate domain-specific grammars:
Use EBNF generative grammars (PCFGs) as contracts.
They are human-accessible and controllable, but also learnable from data.
They ensure that outputs always respect the intended structure, turning prompting into programming with guarantees.

In other words: the future of AI isn’t just larger models — it’s structurally grounded models where language, symbols, and reasoning cohere by design.
PS: If you’re interested, let’s chat, I have a few ideas how to actually build it.
The Principle for Industrial AI
Most industrial applications don’t require general-purpose reasoning across everything. They live in limited domains: regulatory workflows, scientific reasoning, customer support. Within such domains, careful prompt engineering can yield dramatic gains.
Two principles stand out:
Structure the response. Give explicit templates/examples. Force step-by-step reasoning into stable scaffolds.
Localize the information. Place necessary data near the point of inference. For long tasks, require repeated summaries or scratchpads so models don’t rely solely on attention.
Takeaway
LLMs are universal computers in theory. But in practice, their performance depends on whether we treat prompting as casual conversation or as programming.
When we design prompts as structured programs — with scaffolds, explicit steps, and localized memory — we unlock the upper bound of model intelligence, not just its average behavior.
So yes: prompting is Turing-complete.
But more importantly: prompting is programming.
Note: I’d like to thank Giacomo Bocchese for the references to other resources.
The code and the prompt is in Github.





