Core Principles¶
Vector Representations, Distance Calculations, and Index Structures
3.1 Vector Representations¶
What is a Vector¶
A Vector is a fundamental mathematical concept that can be understood as an ordered array of numbers with "direction and magnitude". In vector databases, each data item is represented as a high-dimensional numerical array.
Plain Language Explanation: Imagine you are describing the features of an article: - The first number represents "technology content" (0.8 means very high) - The second number represents "entertainment value" (0.2 means not very entertaining) - The third number represents "political sensitivity" (-0.3 means slightly negative) - ...
This article can be represented as a vector like [0.8, 0.2, -0.3, ...].
# Vector example: vector representations of different texts
text_vectors = {
"The weather is nice today": [0.1, 0.9, 0.3, -0.1],
"Artificial intelligence will change the world": [0.8, 0.2, 0.7, 0.5],
"Watched an exciting movie": [0.2, 0.8, 0.4, 0.3],
}
# Dimension: each vector has 4 numbers, so it's a 4-dimensional vector
# [0.1, 0.9, 0.3, -0.1] # 4-dimensional vector
Intuitive Understanding of Vectors¶
graph TD
subgraph Low-Dimensional to High-Dimensional Mapping
A[1D: Point on a line] --> B[2D: Point on a plane]
B --> C[3D: Point in space]
C --> D[High-D: Cannot be visualized
But mathematically consistent]
end
subgraph Vector Representation Examples
E["Text: 'Apple'"] --> F["Vector: [0.8, 0.1, 0.3, ...]"]
G["Image: Cat photo"] --> H["Vector: [0.2, 0.7, 0.4, ...]"]
H --> J["All data can be
represented as numerical vectors"]
end
style D fill:#e1f5fe
style J fill:#c8e6c9
Mathematical Properties of Vectors¶
| Property | Description | Example |
|---|---|---|
| Direction | Vectors have direction, different directions represent different semantics | [1,0] and [0,1] have completely different directions |
| Distance | Vectors with closer directions are more semantically similar | [0.9, 0.1] and [0.8, 0.2] are very close |
| Computability | Arithmetic operations can be performed between vectors | Famous example: king - man + woman ≈ queen |
3.2 Distance Calculation¶
What is Vector Distance¶
Vector distance is a metric for measuring the "degree of similarity" between two vectors. The smaller the distance, the more similar the two vectors are.
Common Distance Metrics¶
1. Euclidean Distance (L2 Distance)¶
Definition: The straight-line distance between two points, calculated as the square root of the sum of squared differences across all dimensions.
Plain Language Understanding: Like calculating the straight-line distance between two points on a map.
import numpy as np
def euclidean_distance(v1, v2):
"""Calculate Euclidean distance"""
return np.sqrt(np.sum((v1 - v2) ** 2))
# Example
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 6, 3])
distance = euclidean_distance(vector_a, vector_b)
print(f"Euclidean distance: {distance:.2f}") # Output: 4.47
Geometric Interpretation:
graph TD
subgraph Euclidean Distance in 2D Space
A["Point A (1,2)"] --> B["Point B (4,6)"]
A --> C["(4,2)"]
C --> B
style A fill:#e1f5fe
style B fill:#ffccbc
style C fill:#b3e5fc
end
NOTE["Euclidean distance = √[(4-1)² + (6-2)²] = √[9 + 16] = √25 = 5"]
Use Cases: - Image/audio feature vectors - When the importance of each dimension is equal - Scenarios where vector magnitude needs to be considered
2. Cosine Similarity¶
Definition: The cosine of the angle between two vectors, with a range of [-1, 1].
Plain Language Understanding: Comparing whether two vectors point in the same direction, regardless of their length.
import numpy as np
def cosine_similarity(v1, v2):
"""Calculate cosine similarity"""
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)
# Example
vector_a = np.array([1, 1, 0])
vector_b = np.array([1, 1, 1])
vector_c = np.array([-1, -1, 0])
print(f"a vs b similarity: {cosine_similarity(vector_a, vector_b):.3f}") # 0.816
print(f"a vs c similarity: {cosine_similarity(vector_a, vector_c):.3f}") # -1.000
Geometric Interpretation:
graph TD
subgraph Cosine Similarity Diagram
O["Origin"]
A["A [1,1]"]
B["B [1,1,1]"]
C["C [-1,-1]"]
O --> A
O --> B
O --> C
AA["Angle 45°
Similarity 0.816"]
CC["Angle 180°
Similarity -1.0"]
end
style A fill:#b3e5fc
style B fill:#c8e6c9
style C fill:#ffccbc
Use Cases: - Text semantic similarity (most commonly used!) - Document topic similarity - Ignoring vector magnitude, focusing only on direction
3. Dot Product Similarity¶
Definition: The sum of products of corresponding elements from two vectors.
def dot_product_similarity(v1, v2):
"""Calculate dot product"""
return np.dot(v1, v2)
# Example
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6])
dot = dot_product_similarity(vector_a, vector_b)
print(f"Dot product: {dot}") # Output: 32 (1*4 + 2*5 + 3*6)
Distance Metric Comparison¶
graph LR
subgraph Distance Metric Comparison
A[Vector A] -->|Same direction| B[Vector B]
A -->|Opposite direction| C[Vector C]
A -->|Perpendicular| D[Vector D]
end
subgraph Similarity Values
B --> E["Cosine: 0.95 (very similar)"]
C --> F["Cosine: -0.95 (opposite)"]
D --> G["Cosine: 0 (orthogonal)"]
end
subgraph Distance Values
B --> H["Euclidean: 0.3 (very close)"]
C --> I["Euclidean: 2.5 (far)"]
D --> J["Euclidean: 1.4 (moderate)"]
end
style E fill:#c8e6c9
style F fill:#ffccbc
style G fill:#fff3e0
How to Choose Distance Metrics¶
| Scenario | Recommended Method | Reason |
|---|---|---|
| Text semantic search | Cosine similarity | Text vector direction is more important than length |
| Image feature matching | Euclidean distance | Need to consider feature intensity |
| Recommendation systems | Dot product | Consider both user preference intensity and item attributes |
| Face recognition | Cosine similarity | Focus on feature direction, unaffected by lighting |
3.3 Index Structures¶
Why Indices are Needed¶
Without an index, a vector database needs to calculate distance with every single vector during a query, with time complexity O(N), which is unacceptable for massive data.
# Linear search without index
def linear_search(query_vector, all_vectors, top_k=10):
distances = []
for vector in all_vectors:
dist = euclidean_distance(query_vector, vector)
distances.append((dist, vector))
distances.sort(key=lambda x: x[0])
return distances[:top_k]
# Problem: 1 million records = 1 million distance calculations = several seconds
# Solution: Build an index to reduce calculations to hundreds
Approximate Nearest Neighbor Search (ANN)¶
Approximate Nearest Neighbor (ANN) is the core technology of vector databases. By building index structures, it significantly improves search speed within "acceptable precision loss range".
flowchart LR
subgraph Exact vs Approximate
A[Exact Search] -->|100% accurate| B["Time: O(N)
unusable when N=1 billion"]
C[Approximate Search] -->|99% accurate| D["Time: O(log N)
millisecond response"]
end
style A fill:#ffccbc
style C fill:#c8e6c9
style B fill:#ffccbc
style D fill:#c8e6c9
Mainstream Index Algorithms¶
1. HNSW (Hierarchical Navigable Small World)¶
HNSW (Hierarchical Navigable Small World) is currently the most popular high-performance vector index algorithm.
Core Ideas: 1. Build a multi-layer graph structure with sparse upper layers and dense lower layers 2. During query, start from the uppermost layer and quickly locate through "greedy search" 3. Use "small world" characteristics to ensure short search paths
HNSW Index Structure:
graph TB
subgraph Layer 2 [L2 Layer - Top Navigation]
L2_1(( )) --> L2_2(( ))
L2_2(( )) --> L2_3(( ))
L2_3(( )) --> L2_1(( ))
end
subgraph Layer 1 [L1 Layer - Middle]
L1_1(( )) --> L1_2(( ))
L1_2(( )) --> L1_3(( ))
L1_3(( )) --> L1_4(( ))
L1_4(( )) --> L1_1(( ))
L1_2(( )) --> L1_5(( ))
end
subgraph Layer 0 [L0 Layer - Bottom Data]
L0_1(( )) --> L0_2(( ))
L0_2(( )) --> L0_3(( ))
L0_3(( )) --> L0_4(( ))
L0_4(( )) --> L0_5(( ))
L0_5(( )) --> L0_6(( ))
L0_6(( )) --> L0_7(( ))
L0_7(( )) --> L0_8(( ))
L0_1(( )) --> L0_4(( ))
L0_3(( )) --> L0_6(( ))
L0_5(( )) --> L0_8(( ))
end
subgraph Query Path
Q((Query)) -->|1. Enter L2| L2_2
L2_2 -->|2. Descend to L1| L1_3
L1_3 -->|3. Descend to L0| L0_4
L0_4 -->|4. Local search| L0_5
end
style L2_1 fill:#90caf9
style L2_2 fill:#90caf9
style L2_3 fill:#90caf9
style Q fill:#ff5722,color:#fff
HNSW Search Process Example:
# HNSW search pseudocode
def hnsw_search(query_vector, index, top_k=10):
# 1. Start from the highest layer
current_layer = index.max_layer
# 2. Greedy search in current layer, find the nearest entry point
entry_point = index.get_entry_point(current_layer)
while current_layer > 0:
neighbors = index.get_neighbors(entry_point, current_layer)
best_next = find_nearest(query_vector, neighbors)
if distance(query_vector, best_next) < distance(query_vector, entry_point):
entry_point = best_next
current_layer -= 1
# 3. Fine search at the bottom layer
results = layer0_search(query_vector, entry_point, top_k)
return results
HNSW Parameter Description:
| Parameter | Description | Recommended Value |
|---|---|---|
| M | Maximum connections per node | 16-64 |
| efConstruction | Dynamic list size during construction | 100-400 |
| efSearch | Dynamic list size during search | 100-500 |
| max_level | Maximum number of layers | log(N) |
2. IVF (Inverted File Index)¶
IVF (Inverted File Index) is a clustering-based index method.
Core Ideas: 1. Cluster all vectors (e.g., K-Means) 2. Maintain an inverted list for each cluster 3. During query, only search the target vector's cluster and neighboring clusters
graph TD
subgraph Cluster Space
C1["Cluster Center 1"]
C2["Cluster Center 2"]
C3["Cluster Center 3"]
C1 -.->|Inverted List| I1["Vectors 1,5,8,12..."]
C2 -.->|Inverted List| I2["Vectors 2,3,7,15..."]
C3 -.->|Inverted List| I3["Vectors 4,6,9,11..."]
end
subgraph Query Process
Q["Query Vector"] -->|Find nearest center| C2
C2 -->|Search this cluster| I2
I2 --> R["Candidate vectors"]
end
style C1 fill:#e1f5fe
style C2 fill:#c8e6c9
style C3 fill:#fff3e0
style I2 fill:#c8e6c9
3. PQ (Product Quantization) - Lossy Compression¶
PQ (Product Quantization) divides high-dimensional vectors into multiple sub-vectors and performs cluster compression separately.
graph TD
subgraph Original Vector
O["Original Vector [0.23, -0.45, 0.89, -0.12, 0.56, -0.78]"]
end
subgraph Segmentation
S1["Segment 1: [0.23, -0.45]"]
S2["Segment 2: [0.89, -0.12]"]
S3["Segment 3: [0.56, -0.78]"]
end
subgraph Quantization
Q1["→ Cluster Center ID: 42"]
Q2["→ Cluster Center ID: 17"]
Q3["→ Cluster Center ID: 89"]
end
O --> S1
O --> S2
O --> S3
S1 --> Q1
S2 --> Q2
S3 --> Q3
Q1 --> R["Compressed: [42, 17, 89]"]
Q2 --> R
Q3 --> R
style O fill:#e1f5fe
style R fill:#c8e6c9
Index Algorithm Comparison¶
| Index Algorithm | Query Speed | Precision | Memory Usage | Use Case |
|---|---|---|---|---|
| HNSW | Extremely fast | High | Medium | General scenarios, recommended first choice |
| IVF | Fast | Medium-High | Low | Large-scale data |
| PQ | Fast | Medium | Very low | Memory-constrained scenarios |
3.4 Nearest Neighbor Search¶
Nearest Neighbor Search Diagram¶
graph TD
subgraph 2D Vector Space Example
Q["Query Vector Q"]
A["A [0.2, 0.3]"]
B["B [0.8, 0.7]"]
C["C [0.5, 0.9]"]
D["D [0.9, 0.2]"]
E["E [0.3, 0.8]"]
end
subgraph Distance Calculation
Q -->|Distance to A| DA["0.36"]
Q -->|Distance to B| DB["0.79"]
Q -->|Distance to C| DC["0.28"]
Q -->|Distance to D| DD["0.70"]
Q -->|Distance to E| DE["0.22"]
end
subgraph Sorting Results
DE -->|Nearest| R1["1. E (0.22)"]
DC -->|Second nearest| R2["2. C (0.28)"]
DA -->|Third| R3["3. A (0.36)"]
end
style Q fill:#ff5722,color:#fff
style E fill:#4caf50
style C fill:#4caf50
ANN vs KNN¶
graph LR
subgraph KNN Exact Search
A[K=5] -->|Find| B["5 exact nearest neighbors
100% accurate
But slow"]
end
subgraph ANN Approximate Search
C[K=5] -->|Find| D["5 approximate nearest neighbors
99% accurate
But fast"]
end
style A fill:#e1f5fe
style C fill:#c8e6c9
style B fill:#ffccbc
style D fill:#c8e6c9
| Comparison | KNN | ANN |
|---|---|---|
| Accuracy | 100% | 95%-99% |
| Speed | Slow | Fast |
| Applicable data volume | <100K | 100K-1B+ |
| Recall | 1.0 | 0.95-0.99 |
3.5 Filtering and Hybrid Search¶
What is Filtered Search¶
In practical applications, we often need to add additional filter conditions during search, such as "only search articles published in 2024".
# Filtered search example
results = vector_db.search(
query_vector=query_embedding,
top_k=10,
filter={
"category": "technology", # Only search tech category
"year": {"$gte": 2024}, # Published >= 2024
"is_published": True # Already published
}
)
Hybrid Search Architecture¶
flowchart TD
subgraph Query Processing
Q["Query Vector"] -->|1| FE[Filter Expression]