Skip to content

Core Principles

Vector Representations, Distance Calculations, and Index Structures

3.1 Vector Representations

What is a Vector

A Vector is a fundamental mathematical concept that can be understood as an ordered array of numbers with "direction and magnitude". In vector databases, each data item is represented as a high-dimensional numerical array.

Plain Language Explanation: Imagine you are describing the features of an article: - The first number represents "technology content" (0.8 means very high) - The second number represents "entertainment value" (0.2 means not very entertaining) - The third number represents "political sensitivity" (-0.3 means slightly negative) - ...

This article can be represented as a vector like [0.8, 0.2, -0.3, ...].

# Vector example: vector representations of different texts
text_vectors = {
    "The weather is nice today": [0.1, 0.9, 0.3, -0.1],
    "Artificial intelligence will change the world": [0.8, 0.2, 0.7, 0.5],
    "Watched an exciting movie": [0.2, 0.8, 0.4, 0.3],
}

# Dimension: each vector has 4 numbers, so it's a 4-dimensional vector
# [0.1, 0.9, 0.3, -0.1]  # 4-dimensional vector

Intuitive Understanding of Vectors

graph TD
    subgraph Low-Dimensional to High-Dimensional Mapping
        A[1D: Point on a line] --> B[2D: Point on a plane]
        B --> C[3D: Point in space]
        C --> D[High-D: Cannot be visualized
But mathematically consistent] end subgraph Vector Representation Examples E["Text: 'Apple'"] --> F["Vector: [0.8, 0.1, 0.3, ...]"] G["Image: Cat photo"] --> H["Vector: [0.2, 0.7, 0.4, ...]"] H --> J["All data can be
represented as numerical vectors"] end style D fill:#e1f5fe style J fill:#c8e6c9

Mathematical Properties of Vectors

Property Description Example
Direction Vectors have direction, different directions represent different semantics [1,0] and [0,1] have completely different directions
Distance Vectors with closer directions are more semantically similar [0.9, 0.1] and [0.8, 0.2] are very close
Computability Arithmetic operations can be performed between vectors Famous example: king - man + woman ≈ queen

3.2 Distance Calculation

What is Vector Distance

Vector distance is a metric for measuring the "degree of similarity" between two vectors. The smaller the distance, the more similar the two vectors are.

Common Distance Metrics

1. Euclidean Distance (L2 Distance)

Definition: The straight-line distance between two points, calculated as the square root of the sum of squared differences across all dimensions.

Plain Language Understanding: Like calculating the straight-line distance between two points on a map.

import numpy as np

def euclidean_distance(v1, v2):
    """Calculate Euclidean distance"""
    return np.sqrt(np.sum((v1 - v2) ** 2))

# Example
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 6, 3])

distance = euclidean_distance(vector_a, vector_b)
print(f"Euclidean distance: {distance:.2f}")  # Output: 4.47

Geometric Interpretation:

graph TD
    subgraph Euclidean Distance in 2D Space
        A["Point A (1,2)"] --> B["Point B (4,6)"]
        A --> C["(4,2)"]
        C --> B

        style A fill:#e1f5fe
        style B fill:#ffccbc
        style C fill:#b3e5fc
    end

    NOTE["Euclidean distance = √[(4-1)² + (6-2)²] = √[9 + 16] = √25 = 5"]

Use Cases: - Image/audio feature vectors - When the importance of each dimension is equal - Scenarios where vector magnitude needs to be considered

2. Cosine Similarity

Definition: The cosine of the angle between two vectors, with a range of [-1, 1].

Plain Language Understanding: Comparing whether two vectors point in the same direction, regardless of their length.

import numpy as np

def cosine_similarity(v1, v2):
    """Calculate cosine similarity"""
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Example
vector_a = np.array([1, 1, 0])
vector_b = np.array([1, 1, 1])
vector_c = np.array([-1, -1, 0])

print(f"a vs b similarity: {cosine_similarity(vector_a, vector_b):.3f}")  # 0.816
print(f"a vs c similarity: {cosine_similarity(vector_a, vector_c):.3f}")  # -1.000

Geometric Interpretation:

graph TD
    subgraph Cosine Similarity Diagram
        O["Origin"]
        A["A [1,1]"]
        B["B [1,1,1]"]
        C["C [-1,-1]"]

        O --> A
        O --> B
        O --> C

        AA["Angle 45°
Similarity 0.816"] CC["Angle 180°
Similarity -1.0"] end style A fill:#b3e5fc style B fill:#c8e6c9 style C fill:#ffccbc

Use Cases: - Text semantic similarity (most commonly used!) - Document topic similarity - Ignoring vector magnitude, focusing only on direction

3. Dot Product Similarity

Definition: The sum of products of corresponding elements from two vectors.

def dot_product_similarity(v1, v2):
    """Calculate dot product"""
    return np.dot(v1, v2)

# Example
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6])

dot = dot_product_similarity(vector_a, vector_b)
print(f"Dot product: {dot}")  # Output: 32 (1*4 + 2*5 + 3*6)

Distance Metric Comparison

graph LR
    subgraph Distance Metric Comparison
        A[Vector A] -->|Same direction| B[Vector B]
        A -->|Opposite direction| C[Vector C]
        A -->|Perpendicular| D[Vector D]
    end

    subgraph Similarity Values
        B --> E["Cosine: 0.95 (very similar)"]
        C --> F["Cosine: -0.95 (opposite)"]
        D --> G["Cosine: 0 (orthogonal)"]
    end

    subgraph Distance Values
        B --> H["Euclidean: 0.3 (very close)"]
        C --> I["Euclidean: 2.5 (far)"]
        D --> J["Euclidean: 1.4 (moderate)"]
    end

    style E fill:#c8e6c9
    style F fill:#ffccbc
    style G fill:#fff3e0

How to Choose Distance Metrics

Scenario Recommended Method Reason
Text semantic search Cosine similarity Text vector direction is more important than length
Image feature matching Euclidean distance Need to consider feature intensity
Recommendation systems Dot product Consider both user preference intensity and item attributes
Face recognition Cosine similarity Focus on feature direction, unaffected by lighting

3.3 Index Structures

Why Indices are Needed

Without an index, a vector database needs to calculate distance with every single vector during a query, with time complexity O(N), which is unacceptable for massive data.

# Linear search without index
def linear_search(query_vector, all_vectors, top_k=10):
    distances = []
    for vector in all_vectors:
        dist = euclidean_distance(query_vector, vector)
        distances.append((dist, vector))

    distances.sort(key=lambda x: x[0])
    return distances[:top_k]

# Problem: 1 million records = 1 million distance calculations = several seconds
# Solution: Build an index to reduce calculations to hundreds

Approximate Nearest Neighbor Search (ANN)

Approximate Nearest Neighbor (ANN) is the core technology of vector databases. By building index structures, it significantly improves search speed within "acceptable precision loss range".

flowchart LR
    subgraph Exact vs Approximate
        A[Exact Search] -->|100% accurate| B["Time: O(N)
unusable when N=1 billion"] C[Approximate Search] -->|99% accurate| D["Time: O(log N)
millisecond response"] end style A fill:#ffccbc style C fill:#c8e6c9 style B fill:#ffccbc style D fill:#c8e6c9

Mainstream Index Algorithms

1. HNSW (Hierarchical Navigable Small World)

HNSW (Hierarchical Navigable Small World) is currently the most popular high-performance vector index algorithm.

Core Ideas: 1. Build a multi-layer graph structure with sparse upper layers and dense lower layers 2. During query, start from the uppermost layer and quickly locate through "greedy search" 3. Use "small world" characteristics to ensure short search paths

HNSW Index Structure:

graph TB
    subgraph Layer 2 [L2 Layer - Top Navigation]
        L2_1(( )) --> L2_2(( ))
        L2_2(( )) --> L2_3(( ))
        L2_3(( )) --> L2_1(( ))
    end

    subgraph Layer 1 [L1 Layer - Middle]
        L1_1(( )) --> L1_2(( ))
        L1_2(( )) --> L1_3(( ))
        L1_3(( )) --> L1_4(( ))
        L1_4(( )) --> L1_1(( ))
        L1_2(( )) --> L1_5(( ))
    end

    subgraph Layer 0 [L0 Layer - Bottom Data]
        L0_1(( )) --> L0_2(( ))
        L0_2(( )) --> L0_3(( ))
        L0_3(( )) --> L0_4(( ))
        L0_4(( )) --> L0_5(( ))
        L0_5(( )) --> L0_6(( ))
        L0_6(( )) --> L0_7(( ))
        L0_7(( )) --> L0_8(( ))
        L0_1(( )) --> L0_4(( ))
        L0_3(( )) --> L0_6(( ))
        L0_5(( )) --> L0_8(( ))
    end

    subgraph Query Path
        Q((Query)) -->|1. Enter L2| L2_2
        L2_2 -->|2. Descend to L1| L1_3
        L1_3 -->|3. Descend to L0| L0_4
        L0_4 -->|4. Local search| L0_5
    end

    style L2_1 fill:#90caf9
    style L2_2 fill:#90caf9
    style L2_3 fill:#90caf9

    style Q fill:#ff5722,color:#fff

HNSW Search Process Example:

# HNSW search pseudocode
def hnsw_search(query_vector, index, top_k=10):
    # 1. Start from the highest layer
    current_layer = index.max_layer

    # 2. Greedy search in current layer, find the nearest entry point
    entry_point = index.get_entry_point(current_layer)
    while current_layer > 0:
        neighbors = index.get_neighbors(entry_point, current_layer)
        best_next = find_nearest(query_vector, neighbors)
        if distance(query_vector, best_next) < distance(query_vector, entry_point):
            entry_point = best_next
        current_layer -= 1

    # 3. Fine search at the bottom layer
    results = layer0_search(query_vector, entry_point, top_k)
    return results

HNSW Parameter Description:

Parameter Description Recommended Value
M Maximum connections per node 16-64
efConstruction Dynamic list size during construction 100-400
efSearch Dynamic list size during search 100-500
max_level Maximum number of layers log(N)

2. IVF (Inverted File Index)

IVF (Inverted File Index) is a clustering-based index method.

Core Ideas: 1. Cluster all vectors (e.g., K-Means) 2. Maintain an inverted list for each cluster 3. During query, only search the target vector's cluster and neighboring clusters

graph TD
    subgraph Cluster Space
        C1["Cluster Center 1"]
        C2["Cluster Center 2"]
        C3["Cluster Center 3"]

        C1 -.->|Inverted List| I1["Vectors 1,5,8,12..."]
        C2 -.->|Inverted List| I2["Vectors 2,3,7,15..."]
        C3 -.->|Inverted List| I3["Vectors 4,6,9,11..."]
    end

    subgraph Query Process
        Q["Query Vector"] -->|Find nearest center| C2
        C2 -->|Search this cluster| I2
        I2 --> R["Candidate vectors"]
    end

    style C1 fill:#e1f5fe
    style C2 fill:#c8e6c9
    style C3 fill:#fff3e0
    style I2 fill:#c8e6c9

3. PQ (Product Quantization) - Lossy Compression

PQ (Product Quantization) divides high-dimensional vectors into multiple sub-vectors and performs cluster compression separately.

graph TD
    subgraph Original Vector
        O["Original Vector [0.23, -0.45, 0.89, -0.12, 0.56, -0.78]"]
    end

    subgraph Segmentation
        S1["Segment 1: [0.23, -0.45]"]
        S2["Segment 2: [0.89, -0.12]"]
        S3["Segment 3: [0.56, -0.78]"]
    end

    subgraph Quantization
        Q1["→ Cluster Center ID: 42"]
        Q2["→ Cluster Center ID: 17"]
        Q3["→ Cluster Center ID: 89"]
    end

    O --> S1
    O --> S2
    O --> S3
    S1 --> Q1
    S2 --> Q2
    S3 --> Q3

    Q1 --> R["Compressed: [42, 17, 89]"]
    Q2 --> R
    Q3 --> R

    style O fill:#e1f5fe
    style R fill:#c8e6c9

Index Algorithm Comparison

Index Algorithm Query Speed Precision Memory Usage Use Case
HNSW Extremely fast High Medium General scenarios, recommended first choice
IVF Fast Medium-High Low Large-scale data
PQ Fast Medium Very low Memory-constrained scenarios

Nearest Neighbor Search Diagram

graph TD
    subgraph 2D Vector Space Example
        Q["Query Vector Q"]
        A["A [0.2, 0.3]"]
        B["B [0.8, 0.7]"]
        C["C [0.5, 0.9]"]
        D["D [0.9, 0.2]"]
        E["E [0.3, 0.8]"]
    end

    subgraph Distance Calculation
        Q -->|Distance to A| DA["0.36"]
        Q -->|Distance to B| DB["0.79"]
        Q -->|Distance to C| DC["0.28"]
        Q -->|Distance to D| DD["0.70"]
        Q -->|Distance to E| DE["0.22"]
    end

    subgraph Sorting Results
        DE -->|Nearest| R1["1. E (0.22)"]
        DC -->|Second nearest| R2["2. C (0.28)"]
        DA -->|Third| R3["3. A (0.36)"]
    end

    style Q fill:#ff5722,color:#fff
    style E fill:#4caf50
    style C fill:#4caf50

ANN vs KNN

graph LR
    subgraph KNN Exact Search
        A[K=5] -->|Find| B["5 exact nearest neighbors
100% accurate
But slow"] end subgraph ANN Approximate Search C[K=5] -->|Find| D["5 approximate nearest neighbors
99% accurate
But fast"] end style A fill:#e1f5fe style C fill:#c8e6c9 style B fill:#ffccbc style D fill:#c8e6c9
Comparison KNN ANN
Accuracy 100% 95%-99%
Speed Slow Fast
Applicable data volume <100K 100K-1B+
Recall 1.0 0.95-0.99

In practical applications, we often need to add additional filter conditions during search, such as "only search articles published in 2024".

# Filtered search example
results = vector_db.search(
    query_vector=query_embedding,
    top_k=10,
    filter={
        "category": "technology",  # Only search tech category
        "year": {"$gte": 2024},     # Published >= 2024
        "is_published": True        # Already published
    }
)

Hybrid Search Architecture

flowchart TD
    subgraph Query Processing
        Q["Query Vector"] -->|1| FE[Filter Expression]