When it comes to deep understanding of problem How attention types help us (traditional, linear)
Introduction
In the world of Artificial Intelligence, “Attention” is the mechanism that allows models to focus on the most relevant parts of an input. Whether it’s a chatbot answering a query or an AI analyzing a medical report, the model must decide which words or data points matter most.
However, as we push AI to solve deeper, more complex problems—like analyzing entire libraries of books or thousands of lines of code—the way we calculate attention becomes a make-or-break factor. To understand why we are moving toward Linear Attention (RWKV), we first need to understand the limitations of the Traditional Attention that powers models like GPT.
Limitations of Matrified Attention
The fundamental flaw of traditional “Matrified” Attention is its Quadratic Complexity ($O(N^2)$). In a standard Transformer, the attention mechanism is calculated using the formula:
Mathematically, the term represents a dot-product between every Query () and every Key (). If your input has tokens, this operation results in an matrix. As grows, the number of individual calculations and the memory required to store this matrix explode quadratically.
For example, doubling the input length (from 2,000 to 4,000 tokens) doesn’t just double the workload—it quadruples it. This leads to the “Memory Wall”: even high-end GPUs eventually run out of VRAM because the KV-Cache (the memory stored to remember previous tokens) grows linearly with every new word generated. Consequently, Matrified Attention becomes a bottleneck for long-form tasks like analyzing legal documents, entire code repositories, or genomic sequences, where the “Attention Matrix” simply becomes too massive to compute or store efficiently.
How Linear Attention Solves It!
Linear Attention solves the efficiency crisis by leveraging the Associative Property of matrix multiplication. In traditional Matrified Attention, the Softmax function acts as a “block” that forces us to compute the () matrix first. Linear Attention replaces the Softmax with a kernel feature map , which allows us to change the order of operations.
In the context of RWKV, this is implemented as a Time-Decayed Linear Attention. It processes tokens one by one, updating a “running summary” state. Mathematically, it replaces the discrete matrix multiplication with a weighted summation that updates over time:
This allows the model to achieve the same expressive power as a Transformer while keeping the memory footprint fixed ( per token) and for the whole context.
The WKV Engine: A Step-by-Step Numerical Walkthrough
To truly understand how RWKV replaces the matrix with a linear stream, let’s look at a concrete example of processing the sentence: “AI is cool”.
Phase 1: Input Embeddings
For simplicity, we use 1D embeddings (single numbers):
- Token 1 (“AI”):
- Token 2 (“is”):
- Token 3 (“cool”):
Constants:
- Decay (w) = (Determines importance loss from the past).
- Bonus (u) = (Simplified).
Phase 2: The Recurrent Processing
RWKV maintains two running states: (Weighted Values) and (Total Weights).
Step 1: Processing “AI” ()
- Update :
- Update :
- Output 1 ():
Step 2: Processing “is” ()
- Update :
- Update :
- Output 2 ():
Step 3: Processing “cool” ()
- Update :
- Update :
- Output 3 ():
Result: The final context embedding is 4.47. now this 4.47 is the new vector which needs to be processed by sonar and gets the next work which is ["and it helps"]
Comparison: RWKV (Linear) vs. Multi-Head Attention
| Feature | Multi-Head Attention (MHA) | RWKV (Linear Attention) |
|---|---|---|
| Complexity (Time) | (Quadratic) | (Linear) |
| Complexity (Memory) | Grows Linearly | per step (Constant State) |
| Inference Scaling | Slower as sequence grows | Constant speed regardless of length |
| KV-Cache | Required (Grows with length) | Not Required (Fixed State) |
| Context Limit | Hard limit by VRAM | Theoretically Infinite |
| Hardware | Optimized for Matrix engines | Very friendly for CPUs and Edge devices |

Conclusion
- Fixed Memory: The model only ever needed to store two numbers ( and ).
- No KV-Cache: Unlike Transformers, where the memory usage grows as the sentence gets longer, RWKV’s memory usage is constant.
- Linear Time: The amount of work done at Step 3 was identical to Step 1. This proves that complexity is , making it infinitely more scalable than the traditional attention.
References
- Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762.
- Peng, B., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13045.
- Katharopoulos, A., et al. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
- RWKV Open Source Foundation. Official RWKV Technical Wiki. Available at: https://wiki.rwkv.com/.
Leave a Reply