Development History¶
Technology Development Timeline¶
timeline
title Vector Database Technology Evolution
section萌芽期
2000s Early : NNDS Research Begins
Naive Nearest Neighbor Search
section Index Algorithm Breakthrough
2003 : KD-Tree Optimization
Space Partition Index
2006 : LSH Locality-Sensitive Hashing
Random Projection Method
2011 : PQ Product Quantization
Lossy Compression Index
section Modern Vector Databases
2017 : FAISS Released
Facebook Open Source
2019 : HNSW Algorithm Matures
Hierarchical Navigable Small World
2020 : Pinecone Cloud Service
Managed Vector Database
section Outbreak Period
2021-2022 : Milvus, Qdrant
Open Source Databases Emerge
2023-2024 : LLM Driven
RAG Application Explosion
1. Germination Period: From Academic Research to Concept Formation¶
What is Nearest Neighbor Search¶
Nearest Neighbor Search (NNS) is the core problem of vector databases: given a query vector, find the most similar vector in a dataset.
The earliest research dates back to the 1970s, when computer scientists began studying how to efficiently solve the problem of "finding the closest point among N points".
Limitations of Early Methods¶
Early methods like KD-Tree suffer from the "curse of dimensionality" when dealing with high-dimensional data (dimension > 20), causing retrieval efficiency to drop dramatically.
# Curse of dimensionality illustration
dimensions = [2, 10, 50, 100, 500, 1000]
efficiency = [1.0, 0.8, 0.3, 0.1, 0.01, 0.001] # Relative retrieval efficiency
# Higher dimensions = sparser data distribution
# Causes "all points to be approximately equidistant"
2. Index Algorithm Breakthrough Period¶
1. Locality-Sensitive Hashing (LSH) - 2003-2006¶
Locality-Sensitive Hashing (LSH) was the first algorithm that could effectively perform approximate nearest neighbor search on large-scale high-dimensional data.
Core Idea: Similar vectors are mapped to the same "bucket" with high probability after hashing.
graph LR
subgraph LSH Working Principle
A[Vector A: Similar] -->|Hash Function| C[Bucket 1]
B[Vector B: Similar] -->|Hash Function| C
D[Vector C: Not Similar] -->|Hash Function| E[Bucket 2]
C --> F[Precise Search Within Bucket]
end
style A fill:#b3e5fc
style B fill:#b3e5fc
style D fill:#ffccbc
style C fill:#c8e6c9
LSH Advantages and Disadvantages:
| Advantages | Disadvantages |
|---|---|
| Theoretically mature | Number of hash buckets needs to be preset |
| Suitable for low-dimensional data | High memory usage |
| Fast query speed | Limited precision |
2. Product Quantization (PQ) - 2011¶
Product Quantization (PQ), proposed by a French research institute, is a lossy compression method that can compress high-dimensional vectors into smaller storage space.
Core Idea: 1. Divide the vector into multiple sub-vectors 2. Cluster each sub-vector space 3. Use cluster center IDs instead of original sub-vectors
# PQ compression example
original_vector = [0.23, -0.45, 0.89, -0.12, 0.56, -0.78] # 6-dimensional original vector
# After PQ compression: represented by cluster center IDs
# Divided into 3 groups, 2 dimensions each
# [0.23, -0.45] -> Cluster center ID: 42
# [0.89, -0.12] -> Cluster center ID: 17
# [0.56, -0.78] -> Cluster center ID: 89
compressed = [42, 17, 89] # Only store 3 integers
# Compression ratio: 6 floats -> 3 integers = 50x compression
3. Deep Learning Driven Period¶
Rise of Word Embeddings¶
In 2013, Google released the Word2Vec model, achieving the first breakthrough in converting words into dense vectors. This "embedding" technology enabled machines to understand semantic relationships between words.
# Word2Vec example: word to vector mapping
word_vectors = {
"king": [0.5, 0.3, -0.2, 0.8],
"queen": [0.4, 0.2, -0.1, 0.9],
"man": [0.6, -0.1, 0.4, 0.2],
"woman": [0.3, 0.1, 0.5, 0.3]
}
# king - man + woman ≈ queen (vector operations!)
# This demonstrates that vectors truly encode semantic information
Transformer and Vector Representations¶
In 2017, the Transformer architecture was proposed, subsequently producing large language models like BERT and GPT. These models can generate high-quality context-dependent vector representations, greatly promoting the development of vector databases.
4. Modern Vector Database Birth Period¶
1. FAISS - 2017¶
FAISS (Facebook AI Similarity Search) was open-sourced by Facebook (now Meta) and is the first industrial-grade vector retrieval library.
# FAISS basic usage example
import faiss
import numpy as np
# Create 10,000 random 128-dimensional vectors as the dataset
dimension = 128
vectors = np.random.random((10000, dimension)).astype('float32')
# Build index
index = faiss.IndexFlatL2(vectors.shape[1]) # L2 distance index
index.add(vectors)
# Query
query = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query, k=5) # Find 5 nearest neighbors
print(f"Nearest neighbor indices: {indices}")
print(f"Distances: {distances}")
FAISS Contributions: - Proved the feasibility of industrial-grade vector retrieval - Provided multiple index algorithms (IVF, PQ, HNSW, etc.) - Open source drove development of the entire field
2. HNSW Algorithm Maturation - 2019¶
HNSW (Hierarchical Navigable Small World) is currently one of the most mainstream vector index algorithms, invented by a Russian scientist.
Core Idea: Build a multi-layer graph structure with sparse upper layers and dense lower layers, achieving fast search through "highways".
graph TB
subgraph Layer 2 [Upper Layer - Sparse]
L2_A((A)) --> L2_C((C))
L2_B((B)) --> L2_C
L2_C --> L2_E((E))
end
subgraph Layer 1 [Middle Layer]
L1_A((A)) --> L1_B((B))
L1_B --> L1_C((C))
L1_C --> L1_D((D))
L1_D --> L1_E((E))
end
subgraph Layer 0 [Bottom Layer - Dense]
L0_A((A)) --> L0_B((B))
L0_B --> L0_C((C))
L0_C --> L0_D((D))
L0_D --> L0_E((E))
L0_B --> L0_E
L0_A --> L0_D
end
style L2_A fill:#bbdefb
style L2_C fill:#bbdefb
style L2_B fill:#bbdefb
style L2_E fill:#bbdefb
style Layer 0 fill:#fff3e0,color:#000
style Layer 1 fill:#e8f5e9,color:#000
style Layer 2 fill:#fce4ec,color:#000
HNSW Advantages: - Extremely fast query speed (millisecond level) - High precision (close to 100% recall) - No parameter tuning needed, simple to use
5. Cloud Services and Open Source Co-Existence Period¶
2020-2022: Rise of Cloud-Native Vector Databases¶
| Product | Company | Features |
|---|---|---|
| Pinecone | Pinecone | Fully managed, cloud-native |
| Milvus | Zilliz | Open source, cloud-native |
| Qdrant | Qdrant team | Open source, Rust implementation |
| Weaviate | SeMI Technologies | Hybrid search |
| Chroma | Chroma | Lightweight, LLM-oriented |
2023-2024: Explosion in the LLM Era¶
The success of ChatGPT demonstrated the powerful capabilities of LLMs, and also fueled the RAG (Retrieval-Augmented Generation) architecture. Vector databases, as core components of RAG, experienced explosive growth.
flowchart LR
A[Large Language Models] -->|Drive| B[RAG Applications]
B -->|Depend on| C[Vector Databases]
C -->|Massive Demand| D[Market Explosion]
D -->|Capital Influx| E[Product Maturation]
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#e8f5e8
style D fill:#ffccbc
style E fill:#c8e6c9
6. Summary of Technology Evolution Patterns¶
Evolution Drivers¶
mindmap
root((Vector Database Evolution Drivers))
Algorithm Breakthroughs
KD-Tree
LSH
HNSW
PQ
Hardware Advances
GPU Acceleration
Large Memory
SSD Popularization
Deep Learning
Word2Vec
Transformer
Multi-Modal Models
Market Demand
Semantic Search
RAG Explosion
Multi-Modal AI
Key Technology Milestones Review¶
| Year | Technology/Product | Significance |
|---|---|---|
| 2003 | LSH | Solved high-dimensional data retrieval problem |
| 2011 | PQ | Achieved vector compression storage |
| 2013 | Word2Vec | Proved the feasibility of semantic vector representations |
| 2017 | FAISS | First industrial-grade vector retrieval library |
| 2019 | HNSW | Achieved ultra-fast, high-precision vector search |
| 2020 | Pinecone | Opened the era of cloud-native vector databases |
| 2023 | RAG Explosion | Vector databases became AI infrastructure |
Chapter Summary¶
Key Points:
- Germination Period (1970-2000): Nearest neighbor search problem was proposed, early algorithms encountered curse of dimensionality
- Algorithm Breakthrough Period (2003-2011): LSH and PQ algorithms laid the theoretical foundation
- Deep Learning Driven Period (2013-2017): Word2Vec and Transformer proved the feasibility of vector representations
- Modern Database Period (2017-Present): FAISS open source, HNSW maturation, cloud services emergence, RAG explosion
Next Chapter Preview: Chapter 3: Core Principles - Deep understanding of vector representations, distance calculations, and index structures →