Core Principles¶

Vector Representations, Distance Calculations, and Index Structures

3.1 Vector Representations¶

What is a Vector¶

A Vector is a fundamental mathematical concept that can be understood as an ordered array of numbers with "direction and magnitude". In vector databases, each data item is represented as a high-dimensional numerical array.

Plain Language Explanation: Imagine you are describing the features of an article: - The first number represents "technology content" (0.8 means very high) - The second number represents "entertainment value" (0.2 means not very entertaining) - The third number represents "political sensitivity" (-0.3 means slightly negative) - ...

This article can be represented as a vector like [0.8, 0.2, -0.3, ...].

# Vector example: vector representations of different texts
text_vectors = {
    "The weather is nice today": [0.1, 0.9, 0.3, -0.1],
    "Artificial intelligence will change the world": [0.8, 0.2, 0.7, 0.5],
    "Watched an exciting movie": [0.2, 0.8, 0.4, 0.3],
}

# Dimension: each vector has 4 numbers, so it's a 4-dimensional vector
# [0.1, 0.9, 0.3, -0.1]  # 4-dimensional vector

Intuitive Understanding of Vectors¶

graph TD
    subgraph Low-Dimensional to High-Dimensional Mapping
        A[1D: Point on a line] --> B[2D: Point on a plane]
        B --> C[3D: Point in space]
        C --> D[High-D: Cannot be visualized
But mathematically consistent]
    end

    subgraph Vector Representation Examples
        E["Text: 'Apple'"] --> F["Vector: [0.8, 0.1, 0.3, ...]"]
        G["Image: Cat photo"] --> H["Vector: [0.2, 0.7, 0.4, ...]"]
        H --> J["All data can be
represented as numerical vectors"]
    end

    style D fill:#e1f5fe
    style J fill:#c8e6c9

Mathematical Properties of Vectors¶

Property	Description	Example
Direction	Vectors have direction, different directions represent different semantics	`[1,0]` and `[0,1]` have completely different directions
Distance	Vectors with closer directions are more semantically similar	`[0.9, 0.1]` and `[0.8, 0.2]` are very close
Computability	Arithmetic operations can be performed between vectors	Famous example: king - man + woman ≈ queen

3.2 Distance Calculation¶

What is Vector Distance¶

Vector distance is a metric for measuring the "degree of similarity" between two vectors. The smaller the distance, the more similar the two vectors are.

Common Distance Metrics¶

1. Euclidean Distance (L2 Distance)¶

Definition: The straight-line distance between two points, calculated as the square root of the sum of squared differences across all dimensions.

Plain Language Understanding: Like calculating the straight-line distance between two points on a map.

import numpy as np

def euclidean_distance(v1, v2):
    """Calculate Euclidean distance"""
    return np.sqrt(np.sum((v1 - v2) ** 2))

# Example
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 6, 3])

distance = euclidean_distance(vector_a, vector_b)
print(f"Euclidean distance: {distance:.2f}")  # Output: 4.47

Geometric Interpretation:

graph TD
    subgraph Euclidean Distance in 2D Space
        A["Point A (1,2)"] --> B["Point B (4,6)"]
        A --> C["(4,2)"]
        C --> B

        style A fill:#e1f5fe
        style B fill:#ffccbc
        style C fill:#b3e5fc
    end

    NOTE["Euclidean distance = √[(4-1)² + (6-2)²] = √[9 + 16] = √25 = 5"]

Use Cases: - Image/audio feature vectors - When the importance of each dimension is equal - Scenarios where vector magnitude needs to be considered

2. Cosine Similarity¶

Definition: The cosine of the angle between two vectors, with a range of [-1, 1].

Plain Language Understanding: Comparing whether two vectors point in the same direction, regardless of their length.

import numpy as np

def cosine_similarity(v1, v2):
    """Calculate cosine similarity"""
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Example
vector_a = np.array([1, 1, 0])
vector_b = np.array([1, 1, 1])
vector_c = np.array([-1, -1, 0])

print(f"a vs b similarity: {cosine_similarity(vector_a, vector_b):.3f}")  # 0.816
print(f"a vs c similarity: {cosine_similarity(vector_a, vector_c):.3f}")  # -1.000

Geometric Interpretation:

graph TD
    subgraph Cosine Similarity Diagram
        O["Origin"]
        A["A [1,1]"]
        B["B [1,1,1]"]
        C["C [-1,-1]"]

        O --> A
        O --> B
        O --> C

        AA["Angle 45°
Similarity 0.816"]
        CC["Angle 180°
Similarity -1.0"]
    end

    style A fill:#b3e5fc
    style B fill:#c8e6c9
    style C fill:#ffccbc

Use Cases: - Text semantic similarity (most commonly used!) - Document topic similarity - Ignoring vector magnitude, focusing only on direction

3. Dot Product Similarity¶

Definition: The sum of products of corresponding elements from two vectors.

def dot_product_similarity(v1, v2):
    """Calculate dot product"""
    return np.dot(v1, v2)

# Example
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6])

dot = dot_product_similarity(vector_a, vector_b)
print(f"Dot product: {dot}")  # Output: 32 (1*4 + 2*5 + 3*6)

Distance Metric Comparison¶

graph LR
    subgraph Distance Metric Comparison
        A[Vector A] -->|Same direction| B[Vector B]
        A -->|Opposite direction| C[Vector C]
        A -->|Perpendicular| D[Vector D]
    end

    subgraph Similarity Values
        B --> E["Cosine: 0.95 (very similar)"]
        C --> F["Cosine: -0.95 (opposite)"]
        D --> G["Cosine: 0 (orthogonal)"]
    end

    subgraph Distance Values
        B --> H["Euclidean: 0.3 (very close)"]
        C --> I["Euclidean: 2.5 (far)"]
        D --> J["Euclidean: 1.4 (moderate)"]
    end

    style E fill:#c8e6c9
    style F fill:#ffccbc
    style G fill:#fff3e0

How to Choose Distance Metrics¶

Scenario	Recommended Method	Reason
Text semantic search	Cosine similarity	Text vector direction is more important than length
Image feature matching	Euclidean distance	Need to consider feature intensity
Recommendation systems	Dot product	Consider both user preference intensity and item attributes
Face recognition	Cosine similarity	Focus on feature direction, unaffected by lighting

3.3 Index Structures¶

Why Indices are Needed¶

Without an index, a vector database needs to calculate distance with every single vector during a query, with time complexity O(N), which is unacceptable for massive data.

# Linear search without index
def linear_search(query_vector, all_vectors, top_k=10):
    distances = []
    for vector in all_vectors:
        dist = euclidean_distance(query_vector, vector)
        distances.append((dist, vector))

    distances.sort(key=lambda x: x[0])
    return distances[:top_k]

# Problem: 1 million records = 1 million distance calculations = several seconds
# Solution: Build an index to reduce calculations to hundreds

Approximate Nearest Neighbor Search (ANN)¶

Approximate Nearest Neighbor (ANN) is the core technology of vector databases. By building index structures, it significantly improves search speed within "acceptable precision loss range".

flowchart LR
    subgraph Exact vs Approximate
        A[Exact Search] -->|100% accurate| B["Time: O(N)
 unusable when N=1 billion"]
        C[Approximate Search] -->|99% accurate| D["Time: O(log N)
millisecond response"]
    end

    style A fill:#ffccbc
    style C fill:#c8e6c9
    style B fill:#ffccbc
    style D fill:#c8e6c9

Mainstream Index Algorithms¶

1. HNSW (Hierarchical Navigable Small World)¶

HNSW (Hierarchical Navigable Small World) is currently the most popular high-performance vector index algorithm.

Core Ideas: 1. Build a multi-layer graph structure with sparse upper layers and dense lower layers 2. During query, start from the uppermost layer and quickly locate through "greedy search" 3. Use "small world" characteristics to ensure short search paths

HNSW Index Structure:

graph TB
    subgraph Layer 2 [L2 Layer - Top Navigation]
        L2_1(( )) --> L2_2(( ))
        L2_2(( )) --> L2_3(( ))
        L2_3(( )) --> L2_1(( ))
    end

    subgraph Layer 1 [L1 Layer - Middle]
        L1_1(( )) --> L1_2(( ))
        L1_2(( )) --> L1_3(( ))
        L1_3(( )) --> L1_4(( ))
        L1_4(( )) --> L1_1(( ))
        L1_2(( )) --> L1_5(( ))
    end

    subgraph Layer 0 [L0 Layer - Bottom Data]
        L0_1(( )) --> L0_2(( ))
        L0_2(( )) --> L0_3(( ))
        L0_3(( )) --> L0_4(( ))
        L0_4(( )) --> L0_5(( ))
        L0_5(( )) --> L0_6(( ))
        L0_6(( )) --> L0_7(( ))
        L0_7(( )) --> L0_8(( ))
        L0_1(( )) --> L0_4(( ))
        L0_3(( )) --> L0_6(( ))
        L0_5(( )) --> L0_8(( ))
    end

    subgraph Query Path
        Q((Query)) -->|1. Enter L2| L2_2
        L2_2 -->|2. Descend to L1| L1_3
        L1_3 -->|3. Descend to L0| L0_4
        L0_4 -->|4. Local search| L0_5
    end

    style L2_1 fill:#90caf9
    style L2_2 fill:#90caf9
    style L2_3 fill:#90caf9

    style Q fill:#ff5722,color:#fff

HNSW Search Process Example:

# HNSW search pseudocode
def hnsw_search(query_vector, index, top_k=10):
    # 1. Start from the highest layer
    current_layer = index.max_layer

    # 2. Greedy search in current layer, find the nearest entry point
    entry_point = index.get_entry_point(current_layer)
    while current_layer > 0:
        neighbors = index.get_neighbors(entry_point, current_layer)
        best_next = find_nearest(query_vector, neighbors)
        if distance(query_vector, best_next) < distance(query_vector, entry_point):
            entry_point = best_next
        current_layer -= 1

    # 3. Fine search at the bottom layer
    results = layer0_search(query_vector, entry_point, top_k)
    return results

HNSW Parameter Description:

Parameter	Description	Recommended Value
M	Maximum connections per node	16-64
efConstruction	Dynamic list size during construction	100-400
efSearch	Dynamic list size during search	100-500
max_level	Maximum number of layers	log(N)

2. IVF (Inverted File Index)¶

IVF (Inverted File Index) is a clustering-based index method.

Core Ideas: 1. Cluster all vectors (e.g., K-Means) 2. Maintain an inverted list for each cluster 3. During query, only search the target vector's cluster and neighboring clusters

graph TD
    subgraph Cluster Space
        C1["Cluster Center 1"]
        C2["Cluster Center 2"]
        C3["Cluster Center 3"]

        C1 -.->|Inverted List| I1["Vectors 1,5,8,12..."]
        C2 -.->|Inverted List| I2["Vectors 2,3,7,15..."]
        C3 -.->|Inverted List| I3["Vectors 4,6,9,11..."]
    end

    subgraph Query Process
        Q["Query Vector"] -->|Find nearest center| C2
        C2 -->|Search this cluster| I2
        I2 --> R["Candidate vectors"]
    end

    style C1 fill:#e1f5fe
    style C2 fill:#c8e6c9
    style C3 fill:#fff3e0
    style I2 fill:#c8e6c9

3. PQ (Product Quantization) - Lossy Compression¶

PQ (Product Quantization) divides high-dimensional vectors into multiple sub-vectors and performs cluster compression separately.

graph TD
    subgraph Original Vector
        O["Original Vector [0.23, -0.45, 0.89, -0.12, 0.56, -0.78]"]
    end

    subgraph Segmentation
        S1["Segment 1: [0.23, -0.45]"]
        S2["Segment 2: [0.89, -0.12]"]
        S3["Segment 3: [0.56, -0.78]"]
    end

    subgraph Quantization
        Q1["→ Cluster Center ID: 42"]
        Q2["→ Cluster Center ID: 17"]
        Q3["→ Cluster Center ID: 89"]
    end

    O --> S1
    O --> S2
    O --> S3
    S1 --> Q1
    S2 --> Q2
    S3 --> Q3

    Q1 --> R["Compressed: [42, 17, 89]"]
    Q2 --> R
    Q3 --> R

    style O fill:#e1f5fe
    style R fill:#c8e6c9

Index Algorithm Comparison¶

Index Algorithm	Query Speed	Precision	Memory Usage	Use Case
HNSW	Extremely fast	High	Medium	General scenarios, recommended first choice
IVF	Fast	Medium-High	Low	Large-scale data
PQ	Fast	Medium	Very low	Memory-constrained scenarios

3.4 Nearest Neighbor Search¶

Nearest Neighbor Search Diagram¶

graph TD
    subgraph 2D Vector Space Example
        Q["Query Vector Q"]
        A["A [0.2, 0.3]"]
        B["B [0.8, 0.7]"]
        C["C [0.5, 0.9]"]
        D["D [0.9, 0.2]"]
        E["E [0.3, 0.8]"]
    end

    subgraph Distance Calculation
        Q -->|Distance to A| DA["0.36"]
        Q -->|Distance to B| DB["0.79"]
        Q -->|Distance to C| DC["0.28"]
        Q -->|Distance to D| DD["0.70"]
        Q -->|Distance to E| DE["0.22"]
    end

    subgraph Sorting Results
        DE -->|Nearest| R1["1. E (0.22)"]
        DC -->|Second nearest| R2["2. C (0.28)"]
        DA -->|Third| R3["3. A (0.36)"]
    end

    style Q fill:#ff5722,color:#fff
    style E fill:#4caf50
    style C fill:#4caf50

ANN vs KNN¶

graph LR
    subgraph KNN Exact Search
        A[K=5] -->|Find| B["5 exact nearest neighbors
100% accurate
But slow"]
    end

    subgraph ANN Approximate Search
        C[K=5] -->|Find| D["5 approximate nearest neighbors
99% accurate
But fast"]
    end

    style A fill:#e1f5fe
    style C fill:#c8e6c9
    style B fill:#ffccbc
    style D fill:#c8e6c9

Comparison	KNN	ANN
Accuracy	100%	95%-99%
Speed	Slow	Fast
Applicable data volume	<100K	100K-1B+
Recall	1.0	0.95-0.99

3.5 Filtering and Hybrid Search¶

What is Filtered Search¶

In practical applications, we often need to add additional filter conditions during search, such as "only search articles published in 2024".

# Filtered search example
results = vector_db.search(
    query_vector=query_embedding,
    top_k=10,
    filter={
        "category": "technology",  # Only search tech category
        "year": {"$gte": 2024},     # Published >= 2024
        "is_published": True        # Already published
    }
)

Hybrid Search Architecture¶

flowchart TD
    subgraph Query Processing
        Q["Query Vector"] -->|1| FE[Filter Expression]