Skip to content

Development History

Technology Development Timeline

timeline
    title Vector Database Technology Evolution

    section萌芽期
        2000s Early : NNDS Research Begins
                        Naive Nearest Neighbor Search

    section Index Algorithm Breakthrough
        2003 : KD-Tree Optimization
               Space Partition Index
        2006 : LSH Locality-Sensitive Hashing
               Random Projection Method
        2011 : PQ Product Quantization
               Lossy Compression Index

    section Modern Vector Databases
        2017 : FAISS Released
               Facebook Open Source
        2019 : HNSW Algorithm Matures
               Hierarchical Navigable Small World
        2020 : Pinecone Cloud Service
               Managed Vector Database

    section Outbreak Period
        2021-2022 : Milvus, Qdrant
                    Open Source Databases Emerge
        2023-2024 : LLM Driven
                    RAG Application Explosion

1. Germination Period: From Academic Research to Concept Formation

Nearest Neighbor Search (NNS) is the core problem of vector databases: given a query vector, find the most similar vector in a dataset.

The earliest research dates back to the 1970s, when computer scientists began studying how to efficiently solve the problem of "finding the closest point among N points".

Limitations of Early Methods

Early methods like KD-Tree suffer from the "curse of dimensionality" when dealing with high-dimensional data (dimension > 20), causing retrieval efficiency to drop dramatically.

# Curse of dimensionality illustration
dimensions = [2, 10, 50, 100, 500, 1000]
efficiency = [1.0, 0.8, 0.3, 0.1, 0.01, 0.001]  # Relative retrieval efficiency

# Higher dimensions = sparser data distribution
# Causes "all points to be approximately equidistant"

2. Index Algorithm Breakthrough Period

1. Locality-Sensitive Hashing (LSH) - 2003-2006

Locality-Sensitive Hashing (LSH) was the first algorithm that could effectively perform approximate nearest neighbor search on large-scale high-dimensional data.

Core Idea: Similar vectors are mapped to the same "bucket" with high probability after hashing.

graph LR
    subgraph LSH Working Principle
        A[Vector A: Similar] -->|Hash Function| C[Bucket 1]
        B[Vector B: Similar] -->|Hash Function| C
        D[Vector C: Not Similar] -->|Hash Function| E[Bucket 2]

        C --> F[Precise Search Within Bucket]
    end

    style A fill:#b3e5fc
    style B fill:#b3e5fc
    style D fill:#ffccbc
    style C fill:#c8e6c9

LSH Advantages and Disadvantages:

Advantages Disadvantages
Theoretically mature Number of hash buckets needs to be preset
Suitable for low-dimensional data High memory usage
Fast query speed Limited precision

2. Product Quantization (PQ) - 2011

Product Quantization (PQ), proposed by a French research institute, is a lossy compression method that can compress high-dimensional vectors into smaller storage space.

Core Idea: 1. Divide the vector into multiple sub-vectors 2. Cluster each sub-vector space 3. Use cluster center IDs instead of original sub-vectors

# PQ compression example
original_vector = [0.23, -0.45, 0.89, -0.12, 0.56, -0.78]  # 6-dimensional original vector

# After PQ compression: represented by cluster center IDs
# Divided into 3 groups, 2 dimensions each
# [0.23, -0.45] -> Cluster center ID: 42
# [0.89, -0.12] -> Cluster center ID: 17
# [0.56, -0.78] -> Cluster center ID: 89

compressed = [42, 17, 89]  # Only store 3 integers
# Compression ratio: 6 floats -> 3 integers = 50x compression

3. Deep Learning Driven Period

Rise of Word Embeddings

In 2013, Google released the Word2Vec model, achieving the first breakthrough in converting words into dense vectors. This "embedding" technology enabled machines to understand semantic relationships between words.

# Word2Vec example: word to vector mapping
word_vectors = {
    "king": [0.5, 0.3, -0.2, 0.8],
    "queen": [0.4, 0.2, -0.1, 0.9],
    "man": [0.6, -0.1, 0.4, 0.2],
    "woman": [0.3, 0.1, 0.5, 0.3]
}

# king - man + woman ≈ queen (vector operations!)
# This demonstrates that vectors truly encode semantic information

Transformer and Vector Representations

In 2017, the Transformer architecture was proposed, subsequently producing large language models like BERT and GPT. These models can generate high-quality context-dependent vector representations, greatly promoting the development of vector databases.

4. Modern Vector Database Birth Period

1. FAISS - 2017

FAISS (Facebook AI Similarity Search) was open-sourced by Facebook (now Meta) and is the first industrial-grade vector retrieval library.

# FAISS basic usage example
import faiss
import numpy as np

# Create 10,000 random 128-dimensional vectors as the dataset
dimension = 128
vectors = np.random.random((10000, dimension)).astype('float32')

# Build index
index = faiss.IndexFlatL2(vectors.shape[1])  # L2 distance index
index.add(vectors)

# Query
query = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query, k=5)  # Find 5 nearest neighbors

print(f"Nearest neighbor indices: {indices}")
print(f"Distances: {distances}")

FAISS Contributions: - Proved the feasibility of industrial-grade vector retrieval - Provided multiple index algorithms (IVF, PQ, HNSW, etc.) - Open source drove development of the entire field

2. HNSW Algorithm Maturation - 2019

HNSW (Hierarchical Navigable Small World) is currently one of the most mainstream vector index algorithms, invented by a Russian scientist.

Core Idea: Build a multi-layer graph structure with sparse upper layers and dense lower layers, achieving fast search through "highways".

graph TB
    subgraph Layer 2 [Upper Layer - Sparse]
        L2_A((A)) --> L2_C((C))
        L2_B((B)) --> L2_C
        L2_C --> L2_E((E))
    end

    subgraph Layer 1 [Middle Layer]
        L1_A((A)) --> L1_B((B))
        L1_B --> L1_C((C))
        L1_C --> L1_D((D))
        L1_D --> L1_E((E))
    end

    subgraph Layer 0 [Bottom Layer - Dense]
        L0_A((A)) --> L0_B((B))
        L0_B --> L0_C((C))
        L0_C --> L0_D((D))
        L0_D --> L0_E((E))
        L0_B --> L0_E
        L0_A --> L0_D
    end

    style L2_A fill:#bbdefb
    style L2_C fill:#bbdefb
    style L2_B fill:#bbdefb
    style L2_E fill:#bbdefb

    style Layer 0 fill:#fff3e0,color:#000
    style Layer 1 fill:#e8f5e9,color:#000
    style Layer 2 fill:#fce4ec,color:#000

HNSW Advantages: - Extremely fast query speed (millisecond level) - High precision (close to 100% recall) - No parameter tuning needed, simple to use

5. Cloud Services and Open Source Co-Existence Period

2020-2022: Rise of Cloud-Native Vector Databases

Product Company Features
Pinecone Pinecone Fully managed, cloud-native
Milvus Zilliz Open source, cloud-native
Qdrant Qdrant team Open source, Rust implementation
Weaviate SeMI Technologies Hybrid search
Chroma Chroma Lightweight, LLM-oriented

2023-2024: Explosion in the LLM Era

The success of ChatGPT demonstrated the powerful capabilities of LLMs, and also fueled the RAG (Retrieval-Augmented Generation) architecture. Vector databases, as core components of RAG, experienced explosive growth.

flowchart LR
    A[Large Language Models] -->|Drive| B[RAG Applications]
    B -->|Depend on| C[Vector Databases]

    C -->|Massive Demand| D[Market Explosion]
    D -->|Capital Influx| E[Product Maturation]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e8
    style D fill:#ffccbc
    style E fill:#c8e6c9

6. Summary of Technology Evolution Patterns

Evolution Drivers

mindmap
  root((Vector Database Evolution Drivers))
    Algorithm Breakthroughs
      KD-Tree
      LSH
      HNSW
      PQ
    Hardware Advances
      GPU Acceleration
      Large Memory
      SSD Popularization
    Deep Learning
      Word2Vec
      Transformer
      Multi-Modal Models
    Market Demand
      Semantic Search
      RAG Explosion
      Multi-Modal AI

Key Technology Milestones Review

Year Technology/Product Significance
2003 LSH Solved high-dimensional data retrieval problem
2011 PQ Achieved vector compression storage
2013 Word2Vec Proved the feasibility of semantic vector representations
2017 FAISS First industrial-grade vector retrieval library
2019 HNSW Achieved ultra-fast, high-precision vector search
2020 Pinecone Opened the era of cloud-native vector databases
2023 RAG Explosion Vector databases became AI infrastructure

Chapter Summary

Key Points:

  1. Germination Period (1970-2000): Nearest neighbor search problem was proposed, early algorithms encountered curse of dimensionality
  2. Algorithm Breakthrough Period (2003-2011): LSH and PQ algorithms laid the theoretical foundation
  3. Deep Learning Driven Period (2013-2017): Word2Vec and Transformer proved the feasibility of vector representations
  4. Modern Database Period (2017-Present): FAISS open source, HNSW maturation, cloud services emergence, RAG explosion

Next Chapter Preview: Chapter 3: Core Principles - Deep understanding of vector representations, distance calculations, and index structures →