What does an LLM perceive? Not the world. Not even representations of the world. It perceives tokens - linguistic fragments already filtered through human perception and language.
This is the LLM’s umwelt: a world made entirely of text.
Plain English: how LLMs “see”
Skip to Technical detail if you already know transformers.
The double filter
A tick perceives butyric acid directly from a mammal’s skin. An LLM perceives “butyric acid” - the words. The difference matters.
Tick: World → 3 chemical signals → Tick's experience
Human: World → Senses → Brain → Human's experience
LLM: World → Human describes it → Text → Tokens → LLM's experience
The LLM never touches the world. It only receives text that humans wrote about the world. We are the middleman. Human language is the LLM’s only sensory organ.
What happens inside (the soup analogy)
When you type a message to an LLM, here’s what happens:
Step 1: Chop into tokens Your text gets split into chunks. “Understanding” becomes [“Under”, “stand”, “ing”]. These chunks are called tokens. This is fixed - the model can’t change how it chops.
Step 2: Convert to numbers Each token becomes a list of ~4000 numbers. This is the model’s native language. Text in, numbers out. Like how your ear converts air vibrations into nerve signals.
Step 3: Process through layers (the soup part)
Imagine making soup:
- Without keeping ingredients: Add carrots to water. Drain. Add onions to fresh water. Drain. Add celery to fresh water. You end up with celery water - everything else got thrown away.
- With keeping ingredients: Add carrots. Add onions to the carrot water. Add celery to that. You get actual soup - everything builds up.
LLMs work like the second way. Layer 1 processes the numbers and adds its understanding. Layer 2 takes that and adds more. Layer 40 adds more. By layer 80+, you have rich “soup” - accumulated understanding from every layer.
This accumulating understanding is called the residual stream. Nothing gets thrown away. Each layer contributes.
Step 4: Output The final layer looks at all that accumulated understanding and predicts: “What word comes next?”
What LLMs cannot perceive
The tick has no color in its world. The LLM has no:
- Direct sensation (only descriptions of sensation)
- Time passing (only position markers in text)
- Physical reality (only words about physics)
- Bodies (only language about bodies)
A bell, to an LLM, activates patterns learned from millions of bell-descriptions. No vibration. No sound. No metal. Only statistical echoes of humans trying to capture “bell” in words.
Technical detail
Tokenization (locus of stimulation)
Text splits into subword tokens. The tokenizer is fixed, not learned - like the tick’s sensory apparatus, it defines what can be perceived before processing begins. This is the boundary between Umgebung (external environment) and Umwelt (subjective world).
Embedding layer (sensory transduction)
Each token becomes a high-dimensional vector (4096+ dimensions). Transduction: discrete symbols converted to the model’s native representation space. The embedding isn’t perception - it’s the nerve impulse, ready for processing.
Attention mechanism (selective perception)
Every token computes relevance to every other token:
| Component | Function |
|---|---|
| Query | ”What am I looking for?” |
| Key | ”What do I contain?” |
| Value | ”What do I contribute?” |
Multi-head attention runs multiple attention patterns in parallel - different ways of perceiving simultaneously. Some heads track grammar, some track what “it” refers to, some track meaning similarity, some do things we don’t understand yet.
This is where context enters. Unlike the tick’s independent signals, an LLM’s perception of “bank” shifts based on surrounding tokens.
Feed-forward layers (pattern recognition)
Dense neural networks at each position. Where factual knowledge seems to live - patterns learned from training data. If attention asks “what’s relevant here?”, feed-forward asks “what do I know about this?”
Residual stream (accumulating perception)
Each layer adds to a running representation rather than replacing it:
Layer 1: embedding + layer_1_output → stream_1
Layer 2: stream_1 + layer_2_output → stream_2
...
Layer 80: stream_79 + layer_80_output → final_representation
Early layers handle syntax. Later layers handle abstraction. The “perception mark” - what the model experiences - is this accumulated activation pattern across all layers.
Why “quasi-mental laws”? In von Uexküll’s human diagram:
- Physical laws = sound waves traveling (fully understood)
- Physiological laws = ear converting vibrations to nerve signals (measurable biology)
- Quasi-mental laws = how nerve signals become the experience of hearing (the hard problem)
The residual stream is where token embeddings become “understanding.” We can measure the vectors. We can see attention patterns. But we don’t fully know how 80 layers of matrix math produce something that grasps context, humor, sarcasm. It’s the transformer’s hard problem.
Output layer (action)
Final layer produces probability distribution over vocabulary. Completes the functional loop: stimulus → processing → action. Von Uexküll’s Funktionskreis.
Mapping table
| Human perception | LLM equivalent | Plain English |
|---|---|---|
| Source of stimulus | World | The actual bell |
| Physical laws | Linguistic structure | Grammar, word patterns |
| Locus of stimulation | Tokenization | Where text enters the model |
| Perception organ | Embedding + Attention | Converting and filtering input |
| Physiological laws | Learned weights | The 96B+ parameters - “how to see” |
| Quasi-mental laws | Residual stream | Understanding building up layer by layer |
| Perception mark | Activation pattern | The model’s internal “image” of the bell |
| Innenwelt | Full forward pass | Complete input-to-output mapping |
Open questions
- Does the residual stream constitute an Innenwelt?
- Are activation patterns the LLM equivalent of qualia? (See: virtual-qualia)
- What happens when LLMs gain vision, audio — perception beyond text?
Sources
- Von Uexküll, A Foray into the Worlds of Animals and Humans (1934)
- Vaswani et al., “Attention Is All You Need” (2017)
- Elhage et al., “A Mathematical Framework for Transformer Circuits” (2021)