Deep Learning

Demystifying Transformers: How LLMs Think

Beyond the hype: A technical breakdown of the Self-Attention mechanism, Positional Encodings, and why "Attention is All You Need".

Shubham
Shubham Kulkarni AI Researcher
Published
Abstract AI Neural Network Visualization

I remember the exact moment I realized everything had changed. It was late 2022, and I asked ChatGPT to explain a piece of PyTorch code I'd been struggling with. It didn't just explain it — it rewrote it, added comments, and suggested an optimization I hadn't considered. The model behind that moment? A Transformer. And understanding how it works — truly understanding it — is, I believe, the most valuable thing an engineer can invest time in right now.

This article isn't a surface-level overview. I'm going to walk you through the Transformer architecture from first principles, starting with why we needed it, all the way to modern variants like Mamba that might replace it. I'll include the actual PyTorch code you can run yourself.

175B Parameters (GPT-3)
128K Context (GPT-4)
O(N²) Attention Complexity

1. The RNN Bottleneck: Why Transformers Were Inevitable

Before 2017, the dominant architecture for language tasks was the Recurrent Neural Network (RNN) and its more sophisticated cousin, the LSTM (Long Short-Term Memory). These models had a fatal flaw: they processed text sequentially, one word at a time, like reading a book character by character.

This creates two problems:

  • Vanishing Gradients: In the sentence "The cat, which was already full from eating three cans of sardines and a bowl of cream while lounging on the windowsill... slept.", an RNN effectively "forgets" the word "cat" by the time it reaches "slept". The gradient signal decays exponentially with distance.
  • No Parallelism: You can't process word 100 until you've processed words 1–99. This makes training on large datasets agonizingly slow, because you can't leverage GPU parallelism.

The Key Insight

What if, instead of reading sequentially, a model could look at every word simultaneously and dynamically decide which words are important for understanding each other word? This is the core idea behind Self-Attention, and it's what makes Transformers fundamentally different from everything that came before.

2. Self-Attention: The Core Mechanism

Self-attention allows the model to look at every other position in the input sequence when encoding a particular word. It doesn't just look at neighbors — it looks at everything, in parallel.

Consider the sentence: "The animal didn't cross the street because it was too tired."

What does "it" refer to? As a human, you instantly know: the animal. But this requires understanding long-range context. Self-attention handles this by computing an attention weight between every pair of words. For the word "it", the model assigns high attention to "animal" and low attention to "street" — exactly the right behavior.

3. Query, Key, Value: The Database Analogy

The math of self-attention revolves around three learned projections: Query (Q), Key (K), and Value (V). The best analogy is a search engine:

  • Query (Q): "What am I looking for?" — Each word asks a question about what context it needs.
  • Key (K): "What do I contain?" — Each word advertises what information it carries.
  • Value (V): "Here's my actual content." — The information that gets retrieved when a match is found.

The attention score between two words is the dot product of QueryA and KeyB. If this score is high, word A will "pay attention to" word B, pulling in more of its Value. The formula:

Attention(Q, K, V) = softmax(QKT / √dk) · V

The √dk scaling factor prevents the dot products from growing too large (which would push the softmax into regions with tiny gradients). This is often overlooked, but it's critical for stable training.

attention.py
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Computes Scaled Dot-Product Attention.
    Args:
        query: (batch, heads, seq_len, d_k)
        key:   (batch, heads, seq_len, d_k)
        value: (batch, heads, seq_len, d_v)
        mask:  Optional causal mask for autoregressive decoding
    """
    d_k = query.size(-1)
    
    # Step 1: Compute raw attention scores
    scores = torch.matmul(query, key.transpose(-2, -1))  # (batch, heads, seq, seq)
    
    # Step 2: Scale to prevent gradient vanishing
    scores = scores / math.sqrt(d_k)
    
    # Step 3: Apply causal mask (for GPT-style models)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Step 4: Softmax normalizes scores to probabilities
    attention_weights = F.softmax(scores, dim=-1)
    
    # Step 5: Weighted sum of values
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights  # Return weights for visualization

4. Multi-Head Attention: Seeing Multiple Perspectives

A single attention head can only capture one type of relationship. But language is multi-dimensional — you might need one head to track syntactic structure (subject-verb agreement) and another to track semantic meaning (coreference resolution).

Multi-Head Attention solves this by running h parallel attention heads, each with their own Q, K, V projections. The outputs are concatenated and projected back to the model dimension:

multi_head_attention.py
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 64 per head
        
        # Learned projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # Output projection
    
    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
        
        # Project and reshape into multiple heads
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
        # Run attention on all heads in parallel
        attn_output, weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads and project
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
        return self.W_o(attn_output)

What Do Different Heads Learn?

Research has shown that different attention heads specialize naturally during training. In GPT-2, certain heads consistently track positional patterns ("attend to the previous word"), while others track semantic relationships ("attend to the subject of this verb"). This emergent specialization is one of the most fascinating properties of multi-head attention.

AI Brain Visualization

5. Positional Encoding: Teaching Order to a Parallel Model

Here's a subtle but critical problem: since attention processes all words simultaneously, it has no inherent sense of word order. "The cat sat on the mat" and "The mat sat on the cat" would produce identical attention patterns!

The solution is Positional Encoding — injecting a unique "position signal" into each word embedding before it enters the Transformer. The original paper uses sinusoidal functions at different frequencies:

positional_encoding.py
import numpy as np

def positional_encoding(max_seq_len, d_model):
    """
    Generates sinusoidal positional encoding.
    Each position gets a unique pattern across dimensions,
    similar to how binary numbers encode position.
    """
    pe = np.zeros((max_seq_len, d_model))
    
    for pos in range(max_seq_len):
        for i in range(0, d_model, 2):
            wavelength = 10000 ** (i / d_model)
            pe[pos, i]     = np.sin(pos / wavelength)  # Even dimensions
            pe[pos, i + 1] = np.cos(pos / wavelength)  # Odd dimensions
    
    return pe  # Shape: (max_seq_len, d_model)

# Why sinusoidal? Because PE[pos+k] can be expressed as a
# linear function of PE[pos], letting the model learn
# RELATIVE positions, not just absolute ones.

6. The Full Architecture: Encoder vs Decoder

The original 2017 Transformer has both an Encoder (processes input) and a Decoder (generates output). But the models you use daily have diverged:

  • BERT = Encoder only — reads the entire input bidirectionally. Great for understanding tasks (classification, NER).
  • GPT = Decoder only — generates text left-to-right, one token at a time. Great for generation.
  • T5 / BART = Full Encoder-Decoder — maps input sequence to output sequence. Great for translation, summarization.
flowchart TD Input["📝 Input Tokens"] --> Embed["Token Embedding\nd_model = 512"] Embed --> PE["➕ Positional Encoding\nsin/cos waves"] subgraph Block ["🔁 Transformer Block × N"] MHA["🧠 Multi-Head\nSelf-Attention\nh=8 heads"] AN1["Add and LayerNorm"] FFN["⚡ Feed-Forward\n2048 → 512"] AN2["Add and LayerNorm"] MHA --> AN1 --> FFN --> AN2 end PE --> Block AN2 --> LN["Final LayerNorm"] LN --> Linear["Linear Projection\nvocab_size"] Linear --> SM["Softmax"] SM --> Output["🎯 Next Token\nProbabilities"]

7. Training vs Inference: Two Very Different Games

One thing that tripped me up initially: training and inference work completely differently in Transformers.

Training
  • • Process entire sequence in parallel
  • • Use teacher forcing: feed the real answer
  • • Loss on all tokens simultaneously
  • • GPU utilization: ~95%
Inference
  • • Generate one token at a time
  • • Use KV Cache to avoid recomputation
  • Autoregressive: each token depends on all previous
  • • GPU utilization: ~30% (memory-bound)

This efficiency gap is why inference optimization (quantization, speculative decoding, KV-cache compression) is one of the hottest research areas in 2026. Training a model is expensive, but serving it to millions of users is where the real cost lives.

8. Beyond Transformers: What Comes Next?

The elephant in the room is O(N²) complexity. Doubling the context window quadruples the compute. This is why GPT-4's 128K context is incredibly expensive to run. Several approaches are competing to solve this:

  • Mamba (State Space Models): Instead of computing attention between all pairs of tokens, SSMs process sequences in linear time using a learned state transition. Mamba has shown GPT-3 level performance at a fraction of the compute.
  • Ring Attention: Distributes the attention computation across multiple GPUs in a ring topology, enabling 1M+ token context windows on existing hardware.
  • Mixture of Experts (MoE): Models like Mixtral activate only a subset of parameters per token, achieving massive model capacity with manageable compute budgets.
  • RWKV: A hybrid RNN-Transformer that can be trained like a Transformer (parallelized) but runs inference like an RNN (linear time). Best of both worlds.

Key Takeaways

  • Attention is not magic — it's a weighted average with learned projections. Once you internalize Q, K, V, everything else follows.
  • Multi-head attention is embarrassingly parallel — this is why GPUs are perfect for Transformers, and why NVIDIA's stock price 10x'd.
  • Positional encoding is the unsung hero — without it, your model literally can't distinguish "dog bites man" from "man bites dog".
  • The future is hybrid — pure Transformers will likely be replaced by architectures that combine attention with linear-time SSMs.