When it comes to deep understanding of problem How attention types help us (traditional, linear)

Introduction

In the world of Artificial Intelligence, “Attention” is the mechanism that allows models to focus on the most relevant parts of an input. Whether it’s a chatbot answering a query or an AI analyzing a medical report, the model must decide which words or data points matter most.

However, as we push AI to solve deeper, more complex problems—like analyzing entire libraries of books or thousands of lines of code—the way we calculate attention becomes a make-or-break factor. To understand why we are moving toward Linear Attention (RWKV), we first need to understand the limitations of the Traditional Attention that powers models like GPT.

Limitations of Matrified Attention

The fundamental flaw of traditional “Matrified” Attention is its Quadratic Complexity ($O(N^2)$). In a standard Transformer, the attention mechanism is calculated using the formula: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

Mathematically, the term $Q K^{T}$ represents a dot-product between every Query ( $Q$ ) and every Key ( $K$ ). If your input has $N$ tokens, this operation results in an $N \times N$ matrix. As $N$ grows, the number of individual calculations and the memory required to store this matrix explode quadratically.

For example, doubling the input length (from 2,000 to 4,000 tokens) doesn’t just double the workload—it quadruples it. This leads to the “Memory Wall”: even high-end GPUs eventually run out of VRAM because the KV-Cache (the memory stored to remember previous tokens) grows linearly with every new word generated. Consequently, Matrified Attention becomes a bottleneck for long-form tasks like analyzing legal documents, entire code repositories, or genomic sequences, where the “Attention Matrix” simply becomes too massive to compute or store efficiently.

How Linear Attention Solves It!

Linear Attention solves the efficiency crisis by leveraging the Associative Property of matrix multiplication. In traditional Matrified Attention, the Softmax function acts as a “block” that forces us to compute the $Q K^{T}$ ( $N \times N$ ) matrix first. Linear Attention replaces the Softmax with a kernel feature map $ϕ (\cdot)$ , which allows us to change the order of operations.

In the context of RWKV, this is implemented as a Time-Decayed Linear Attention. It processes tokens one by one, updating a “running summary” state. Mathematically, it replaces the discrete matrix multiplication with a weighted summation that updates over time: $W K V_{t} = \frac{\sum_{i = 1}^{t} e^{- (t - i) w} e^{k_{i}} v_{i}}{\sum_{i = 1}^{t} e^{- (t - i) w} e^{k_{i}}}$

This allows the model to achieve the same expressive power as a Transformer while keeping the memory footprint fixed ( $O (1)$ per token) and $O (N)$ for the whole context.

The WKV Engine: A Step-by-Step Numerical Walkthrough

To truly understand how RWKV replaces the $O (N^{2})$ matrix with a linear stream, let’s look at a concrete example of processing the sentence: “AI is cool”.

Phase 1: Input Embeddings

For simplicity, we use 1D embeddings (single numbers):

Token 1 (“AI”): $k_{1} = 1.0, v_{1} = 2.0$
Token 2 (“is”): $k_{2} = 0.5, v_{2} = 3.0$
Token 3 (“cool”): $k_{3} = 2.0, v_{3} = 5.0$

Constants:

Decay (w) = $0.5$ (Determines importance loss from the past).
Bonus (u) = $0$ (Simplified).

Phase 2: The Recurrent Processing

RWKV maintains two running states: $a_{t}$ (Weighted Values) and $b_{t}$ (Total Weights).

Step 1: Processing “AI” ( $t = 1$ )

Initial states: $a_{0} = 0, b_{0} = 0$ .

Update $a_{1}$ : $(e^{- 0.5} \times 0) + e^{1.0} \times 2.0 = 0 + e^{1.0} \times 2.0 \approx 5.436$
Update $b_{1}$ : $(e^{- 0.5} \times 0) + e^{1.0} = 0 + e^{1.0} \approx 2.718$
Output 1 ( $a_{1} / b_{1}$ ): $5.436 / 2.718 = 2.0$

Step 2: Processing “is” ( $t = 2$ )

Uses previous states ( $a_{1}, b_{1}$ ):

Update $a_{2}$ : $(e^{- 0.5} \times a_{1}) + e^{k_{2}} \times v_{2} = (0.606 \times 5.436) + (e^{0.5} \times 3.0) = 3.294 + 4.946 = 8.24$
Update $b_{2}$ : $(e^{- 0.5} \times b_{1}) + e^{k_{2}} = (0.606 \times 2.718) + 1.648 = 1.647 + 1.648 = 3.295$
Output 2 ( $a_{2} / b_{2}$ ): $8.24 / 3.295 \approx 2.50$

Step 3: Processing “cool” ( $t = 3$ )

Uses previous states ( $a_{2}, b_{2}$ ):

Update $a_{3}$ : $(e^{- 0.5} \times a_{2}) + e^{k_{3}} \times v_{3} = (0.606 \times 8.24) + (e^{2.0} \times 5.0) = 4.99 + 36.94 = 41.93$
Update $b_{3}$ : $(e^{- 0.5} \times b_{2}) + e^{k_{3}} = (0.606 \times 3.295) + 7.389 = 1.99 + 7.389 = 9.379$
Output 3 ( $a_{3} / b_{3}$ ): $41.93 / 9.379 \approx 4.47$

Result: The final context embedding is 4.47. now this 4.47 is the new vector which needs to be processed by sonar and gets the next work which is ["and it helps"]

Comparison: RWKV (Linear) vs. Multi-Head Attention

Feature	Multi-Head Attention (MHA)	RWKV (Linear Attention)
Complexity (Time)	(Quadratic)	(Linear)
Complexity (Memory)	Grows Linearly	per step (Constant State)
Inference Scaling	Slower as sequence grows	Constant speed regardless of length
KV-Cache	Required (Grows with length)	Not Required (Fixed State)
Context Limit	Hard limit by VRAM	Theoretically Infinite
Hardware	Optimized for Matrix engines	Very friendly for CPUs and Edge devices

Conclusion

Fixed Memory: The model only ever needed to store two numbers ( $a_{t}$ and $b_{t}$ ).
No KV-Cache: Unlike Transformers, where the memory usage grows as the sentence gets longer, RWKV’s memory usage is constant.
Linear Time: The amount of work done at Step 3 was identical to Step 1. This proves that complexity is $O (N)$ , making it infinitely more scalable than the traditional $O (N^{2})$ attention.

References

Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762.
Peng, B., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13045.
Katharopoulos, A., et al. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
RWKV Open Source Foundation. Official RWKV Technical Wiki. Available at: https://wiki.rwkv.com/.

Agentic AI: Why AI That Actually Does Things Is the Next Big Shift

Recursive Language Models: Beyond the Context Window