Ignito

Ignito

Share this post

Ignito
Ignito
[Launching LLM System Design ] Large Language Models: From Tokens to Optimization: How They Actually Work - Part 1
Ignito

[Launching LLM System Design ] Large Language Models: From Tokens to Optimization: How They Actually Work - Part 1

All the technical details you need to know...

Jul 22, 2025
∙ Paid
2

Share this post

Ignito
Ignito
[Launching LLM System Design ] Large Language Models: From Tokens to Optimization: How They Actually Work - Part 1
1
Share

Table of Contents

Tokens
Embeddings
Transformers
Variants
Optimization
LLMs

1. Tokens

What are Tokens?

Tokens are the fundamental units that language models use to process text. Think of tokens as the "words" that a computer understands - they can be actual words, parts of words (subwords), or even individual characters, depending on the tokenization strategy used.

Diagram 1: Text-to-Token Conversion Process

Input Text: "Hello world! How are you?"

Step 1: Raw Text Processing
"Hello world! How are you?" → ["Hello", " world", "!", " How", " are", " you", "?"]

Step 2: Token ID Mapping
["Hello", " world", "!", " How", " are", " you", "?"] → [15496, 995, 0, 1374, 389, 345, 30]

Step 3: Model Input
[15496, 995, 0, 1374, 389, 345, 30] → Neural Network Processing

Diagram 2: Different Tokenization Strategies

Word-level Tokenization:
"running" → ["running"] (1 token)

Subword Tokenization (BPE):
"running" → ["run", "ning"] (2 tokens)

Character-level Tokenization:
"running" → ["r", "u", "n", "n", "i", "n", "g"] (7 tokens)

Byte-level Tokenization:
"running" → [114, 117, 110, 110, 105, 110, 103] (7 byte tokens)

Diagram 3: BPE (Byte Pair Encoding) Algorithm Working

Step 1: Initialize with characters
Vocabulary: {h, e, l, o, w, r, d}
Text: "hello world" → [h, e, l, l, o, w, o, r, l, d]

Step 2: Count adjacent pairs
Pairs: {he: 1, el: 2, ll: 1, lo: 2, ow: 1, wo: 1, or: 1, rl: 1, ld: 1}
Most frequent: "el" and "lo" (count: 2)

Step 3: Merge most frequent pair "el" → "el"
New vocab: {h, el, l, o, w, r, d}
Text: "hello world" → [h, el, l, o, w, o, r, l, d]

Step 4: Repeat until desired vocab size
Final result: [hel, lo, wor, ld] (4 tokens)
Efficient balance between vocabulary size and sequence length

Diagram 4: Token Vocabulary and OOV Handling

Standard Vocabulary (50K tokens):
┌─────────────────────────────────┐
│ Common words: "the", "and", "is"│
│ Subwords: "ing", "tion", "pre"  │
│ Rare words: "antidisestablish"  │
│ Special: [PAD], [UNK], [MASK]   │
└─────────────────────────────────┘

Out-of-Vocabulary (OOV) Word: "supercalifragilisticexpialidocious"
                ↓
Subword Decomposition Process:
"super" → Found in vocab ✓
"cal" → Found in vocab ✓
"ifrag" → Not found, break down further
"i" → Found ✓, "frag" → Found ✓
...continuing...

Final tokens: ["super", "cal", "i", "frag", "il", "istic", "exp", "ial", "id", "oci", "ous"]
Never produces [UNK] token - always decomposable!

Why Tokens are Used and Where

Why Tokens are Essential:

  • Computational Efficiency: Neural networks work with numbers, not raw text

  • Vocabulary Management: Handle millions of possible words with a fixed vocabulary size

  • Cross-lingual Support: Subword tokens can handle multiple languages efficiently

  • Out-of-vocabulary Handling: Break unknown words into known subword pieces


Below are the top 10 System Design Case studies for this week

Billions of Queries Daily : How Google Search Actually Works

100+ Million Requests per Second : How Amazon Shopping Cart Actually Works

Serving 132+ Million Users : Scaling for Global Transit Real Time Ride Sharing Market at Uber

3 Billion Daily Users : How Youtube Actually Scales

$100000 per BTC : How Bitcoin Actually Works

$320 Billion Crypto Transactions Volume: How Coinbase Actually Works

100K Events per Second : How Uber Real-Time Surge Pricing Actually Works

Processing 2 Billion Daily Queries : How Facebook Graph Search Actually Works

7 Trillion Messages Daily : Magic Behind LinkedIn Architecture and How It Actually Works

1 Billion Tweets Daily : Magic Behind Twitter Scaling and How It Actually Works

12 Million Daily Users: Inside Slack's Real-Time Messaging Magic and How it Actually Works

3 Billion Daily Users : How Youtube Actually Scales

1.5 Billion Swipes per Day : How Tinder Matching Actually Works

500+ Million Users Daily : How Instagram Stories Actually Work

2.9 Billion Daily Active Users : How Facebook News Feed Algorithm Actually Works

20 Billion Messages Daily: How Facebook Messenger Actually Works

8+ Billion Daily Views: How Facebook's Live Video Ranking Algorithm Works

How Discord's Real-Time Chat Scales to 200+ Million Users

80 Million Photos Daily : How Instagram Achieves Real Time Photo Sharing

Serving 1 Trillion Edges in Social Graph with 1ms Read Times : How Facebook TAO works

How Lyft Handles 2x Traffic Spikes during Peak Hours with Auto scaling Infrastructure..

Where Tokens are Used:

  • Input Processing: Converting user queries into model-readable format

  • Training Data: All training text is tokenized before feeding to the model

  • Inference: Every prediction starts with tokenization

  • Multilingual Models: Handling diverse languages with shared token vocabularies

How to Use Tokens

Practical Implementation:

# Using Hugging Face Transformers
from transformers import AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize text
text = "Hello world! How are you?"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Get token-to-text mapping
token_text_pairs = [(token, tokenizer.decode([token])) 
                   for token in tokens]
print(f"Token mappings: {token_text_pairs}")

Best Practices:

  • Choose appropriate tokenization strategy based on your use case

  • Consider vocabulary size vs. sequence length trade-offs

  • Handle special tokens (padding, start/end markers) properly

  • Be aware of tokenization differences between models


2. Embeddings

What are Embeddings?

Embeddings are dense vector representations that capture semantic meaning of tokens in a continuous mathematical space. They transform discrete tokens into continuous vectors that neural networks can effectively process and learn from.

Diagram 1: Token to Embedding Transformation

Token ID: 15496 ("Hello")
         ↓
Embedding Matrix (Vocab Size × Embedding Dim)
[15496] → Row 15496: [0.2, -0.1, 0.8, 0.3, -0.5, ..., 0.4]
         ↓
Dense Vector (512-dimensional)
Embedding: [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, -0.2, ..., 0.4]

Diagram 2: Semantic Space Representation

2D Visualization (actual embeddings are 512+ dimensional):

    cat •     • kitten
        \   /
         pet
          |
    dog • | • puppy
        \|/
      animal

    car •     • vehicle
        \   /
      automobile
          |
   truck •|• bus
        \|/
     transport

Similar concepts cluster together in embedding space
Distance represents semantic similarity

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Naina
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share