Development History¶

Technology Development Timeline¶

timeline
    title Vector Database Technology Evolution

    section萌芽期
        2000s Early : NNDS Research Begins
                        Naive Nearest Neighbor Search

    section Index Algorithm Breakthrough
        2003 : KD-Tree Optimization
               Space Partition Index
        2006 : LSH Locality-Sensitive Hashing
               Random Projection Method
        2011 : PQ Product Quantization
               Lossy Compression Index

    section Modern Vector Databases
        2017 : FAISS Released
               Facebook Open Source
        2019 : HNSW Algorithm Matures
               Hierarchical Navigable Small World
        2020 : Pinecone Cloud Service
               Managed Vector Database

    section Outbreak Period
        2021-2022 : Milvus, Qdrant
                    Open Source Databases Emerge
        2023-2024 : LLM Driven
                    RAG Application Explosion

1. Germination Period: From Academic Research to Concept Formation¶

What is Nearest Neighbor Search¶

Nearest Neighbor Search (NNS) is the core problem of vector databases: given a query vector, find the most similar vector in a dataset.

The earliest research dates back to the 1970s, when computer scientists began studying how to efficiently solve the problem of "finding the closest point among N points".

Limitations of Early Methods¶

Early methods like KD-Tree suffer from the "curse of dimensionality" when dealing with high-dimensional data (dimension > 20), causing retrieval efficiency to drop dramatically.

# Curse of dimensionality illustration
dimensions = [2, 10, 50, 100, 500, 1000]
efficiency = [1.0, 0.8, 0.3, 0.1, 0.01, 0.001]  # Relative retrieval efficiency

# Higher dimensions = sparser data distribution
# Causes "all points to be approximately equidistant"

2. Index Algorithm Breakthrough Period¶

1. Locality-Sensitive Hashing (LSH) - 2003-2006¶

Locality-Sensitive Hashing (LSH) was the first algorithm that could effectively perform approximate nearest neighbor search on large-scale high-dimensional data.

Core Idea: Similar vectors are mapped to the same "bucket" with high probability after hashing.

graph LR
    subgraph LSH Working Principle
        A[Vector A: Similar] -->|Hash Function| C[Bucket 1]
        B[Vector B: Similar] -->|Hash Function| C
        D[Vector C: Not Similar] -->|Hash Function| E[Bucket 2]

        C --> F[Precise Search Within Bucket]
    end

    style A fill:#b3e5fc
    style B fill:#b3e5fc
    style D fill:#ffccbc
    style C fill:#c8e6c9

LSH Advantages and Disadvantages:

Advantages	Disadvantages
Theoretically mature	Number of hash buckets needs to be preset
Suitable for low-dimensional data	High memory usage
Fast query speed	Limited precision

2. Product Quantization (PQ) - 2011¶

Product Quantization (PQ), proposed by a French research institute, is a lossy compression method that can compress high-dimensional vectors into smaller storage space.

Core Idea: 1. Divide the vector into multiple sub-vectors 2. Cluster each sub-vector space 3. Use cluster center IDs instead of original sub-vectors

# PQ compression example
original_vector = [0.23, -0.45, 0.89, -0.12, 0.56, -0.78]  # 6-dimensional original vector

# After PQ compression: represented by cluster center IDs
# Divided into 3 groups, 2 dimensions each
# [0.23, -0.45] -> Cluster center ID: 42
# [0.89, -0.12] -> Cluster center ID: 17
# [0.56, -0.78] -> Cluster center ID: 89

compressed = [42, 17, 89]  # Only store 3 integers
# Compression ratio: 6 floats -> 3 integers = 50x compression

3. Deep Learning Driven Period¶

Rise of Word Embeddings¶

In 2013, Google released the Word2Vec model, achieving the first breakthrough in converting words into dense vectors. This "embedding" technology enabled machines to understand semantic relationships between words.

# Word2Vec example: word to vector mapping
word_vectors = {
    "king": [0.5, 0.3, -0.2, 0.8],
    "queen": [0.4, 0.2, -0.1, 0.9],
    "man": [0.6, -0.1, 0.4, 0.2],
    "woman": [0.3, 0.1, 0.5, 0.3]
}

# king - man + woman ≈ queen (vector operations!)
# This demonstrates that vectors truly encode semantic information

Transformer and Vector Representations¶

In 2017, the Transformer architecture was proposed, subsequently producing large language models like BERT and GPT. These models can generate high-quality context-dependent vector representations, greatly promoting the development of vector databases.

4. Modern Vector Database Birth Period¶

1. FAISS - 2017¶

FAISS (Facebook AI Similarity Search) was open-sourced by Facebook (now Meta) and is the first industrial-grade vector retrieval library.

# FAISS basic usage example
import faiss
import numpy as np

# Create 10,000 random 128-dimensional vectors as the dataset
dimension = 128
vectors = np.random.random((10000, dimension)).astype('float32')

# Build index
index = faiss.IndexFlatL2(vectors.shape[1])  # L2 distance index
index.add(vectors)

# Query
query = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query, k=5)  # Find 5 nearest neighbors

print(f"Nearest neighbor indices: {indices}")
print(f"Distances: {distances}")

FAISS Contributions: - Proved the feasibility of industrial-grade vector retrieval - Provided multiple index algorithms (IVF, PQ, HNSW, etc.) - Open source drove development of the entire field

2. HNSW Algorithm Maturation - 2019¶

HNSW (Hierarchical Navigable Small World) is currently one of the most mainstream vector index algorithms, invented by a Russian scientist.

Core Idea: Build a multi-layer graph structure with sparse upper layers and dense lower layers, achieving fast search through "highways".

graph TB
    subgraph Layer 2 [Upper Layer - Sparse]
        L2_A((A)) --> L2_C((C))
        L2_B((B)) --> L2_C
        L2_C --> L2_E((E))
    end

    subgraph Layer 1 [Middle Layer]
        L1_A((A)) --> L1_B((B))
        L1_B --> L1_C((C))
        L1_C --> L1_D((D))
        L1_D --> L1_E((E))
    end

    subgraph Layer 0 [Bottom Layer - Dense]
        L0_A((A)) --> L0_B((B))
        L0_B --> L0_C((C))
        L0_C --> L0_D((D))
        L0_D --> L0_E((E))
        L0_B --> L0_E
        L0_A --> L0_D
    end

    style L2_A fill:#bbdefb
    style L2_C fill:#bbdefb
    style L2_B fill:#bbdefb
    style L2_E fill:#bbdefb

    style Layer 0 fill:#fff3e0,color:#000
    style Layer 1 fill:#e8f5e9,color:#000
    style Layer 2 fill:#fce4ec,color:#000

HNSW Advantages: - Extremely fast query speed (millisecond level) - High precision (close to 100% recall) - No parameter tuning needed, simple to use

5. Cloud Services and Open Source Co-Existence Period¶

2020-2022: Rise of Cloud-Native Vector Databases¶

Product	Company	Features
Pinecone	Pinecone	Fully managed, cloud-native
Milvus	Zilliz	Open source, cloud-native
Qdrant	Qdrant team	Open source, Rust implementation
Weaviate	SeMI Technologies	Hybrid search
Chroma	Chroma	Lightweight, LLM-oriented

2023-2024: Explosion in the LLM Era¶

The success of ChatGPT demonstrated the powerful capabilities of LLMs, and also fueled the RAG (Retrieval-Augmented Generation) architecture. Vector databases, as core components of RAG, experienced explosive growth.

flowchart LR
    A[Large Language Models] -->|Drive| B[RAG Applications]
    B -->|Depend on| C[Vector Databases]

    C -->|Massive Demand| D[Market Explosion]
    D -->|Capital Influx| E[Product Maturation]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e8
    style D fill:#ffccbc
    style E fill:#c8e6c9

6. Summary of Technology Evolution Patterns¶

Evolution Drivers¶

mindmap
  root((Vector Database Evolution Drivers))
    Algorithm Breakthroughs
      KD-Tree
      LSH
      HNSW
      PQ
    Hardware Advances
      GPU Acceleration
      Large Memory
      SSD Popularization
    Deep Learning
      Word2Vec
      Transformer
      Multi-Modal Models
    Market Demand
      Semantic Search
      RAG Explosion
      Multi-Modal AI

Key Technology Milestones Review¶

Year	Technology/Product	Significance
2003	LSH	Solved high-dimensional data retrieval problem
2011	PQ	Achieved vector compression storage
2013	Word2Vec	Proved the feasibility of semantic vector representations
2017	FAISS	First industrial-grade vector retrieval library
2019	HNSW	Achieved ultra-fast, high-precision vector search
2020	Pinecone	Opened the era of cloud-native vector databases
2023	RAG Explosion	Vector databases became AI infrastructure

Chapter Summary¶

Key Points:

Germination Period (1970-2000): Nearest neighbor search problem was proposed, early algorithms encountered curse of dimensionality
Algorithm Breakthrough Period (2003-2011): LSH and PQ algorithms laid the theoretical foundation
Deep Learning Driven Period (2013-2017): Word2Vec and Transformer proved the feasibility of vector representations
Modern Database Period (2017-Present): FAISS open source, HNSW maturation, cloud services emergence, RAG explosion

Next Chapter Preview: Chapter 3: Core Principles - Deep understanding of vector representations, distance calculations, and index structures →