Basic Concepts¶
Embeddings, Dimensions, Similarity Metrics
4.1 Vector Embedding¶
What is Vector Embedding¶
Vector Embedding is the process of converting discrete, high-dimensional raw data (such as text, images, audio) into continuous, low-dimensional dense vectors.
Core Objective: Make "semantically similar" data "close in distance" within the vector space.
# Vectorization example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Text embedding
texts = ["I want to buy an iPhone", "iPhone has great camera", "The weather is nice today"]
embeddings = model.encode(texts)
print(f"Vector shape: {embeddings.shape}") # (3, 384) - 3 texts, each 384 dimensions
print(f"First text vector: {embeddings[0][:5]}...") # [0.12, -0.34, 0.56, ...]
Intuitive Understanding of Embeddings¶
graph TD
subgraph Embedding Process
A["Raw Data"] -->|Embedding Model| B["High-Dimensional Vector"]
end
subgraph Different Data Embeddings
B -->|Text| T["Text Vectors"]
B -->|Images| I["Image Vectors"]
B -->|Audio| S["Audio Vectors"]
B -->|Video| V["Video Vectors"]
end
subgraph Vector Space
T -->|"Close positions"| T1["'Dog' → [0.2, 0.8]"]
T -->|"Close positions"| T2["'Cat' → [0.2, 0.7]"]
T1 -->|"Far apart"| T3["'Car' → [0.9, 0.1]"]
end
style A fill:#e1f5fe
style B fill:#fff3e0
style T fill:#c8e6c9
style I fill:#c8e6c9
style S fill:#c8e6c9
style V fill:#c8e6c9
Key Characteristics of Embeddings¶
1. Semantic Preservation¶
Semantically similar content is close in the vector space:
# Semantic similarity → close vector distance
result = cosine_similarity(
embed("king"), # [0.8, 0.2, 0.1, ...]
embed("queen") # [0.7, 0.3, 0.2, ...]
)
print(f"king vs queen: {result:.3f}") # ~0.95 (very close)
result = cosine_similarity(
embed("king"),
embed("car")
)
print(f"king vs car: {result:.3f}") # ~0.1 (far apart)
2. Computability¶
Mathematical operations can be performed between vectors:
# Famous example: king - man + woman ≈ queen
# This perfectly demonstrates the semantic space of vector embeddings
king = embed("king")
man = embed("man")
woman = embed("woman")
# Vector operations
result = king - man + woman
# Verify similarity with queen
queen = embed("queen")
similarity = cosine_similarity(result, queen)
print(f"king - man + woman is close to queen: {similarity:.3f}") # ~0.8
3. Dimensionality Reduction¶
Converting high-dimensional sparse data into low-dimensional dense vectors:
| Original Data | One-Hot Encoding Dimension | Post-Embedding Dimension | Compression Ratio |
|---|---|---|---|
| English words | ~50,000 | 128-512 | 100-400x |
| Image pixels | 224×224×3 | 512-2048 | 60-300x |
| Document vocabulary | 10,000+ | 768-1536 | 10-15x |
Mainstream Embedding Models¶
| Model | Developer | Vector Dimension | Features |
|---|---|---|---|
| Word2Vec | 100-300 | Classic word embedding | |
| BERT | 768-1024 | Context-dependent | |
| Sentence-BERT | UKPLab | 384-1024 | Sentence-level |
| OpenAI Embeddings | OpenAI | 1536 | API call |
| CLIP | OpenAI | 512-1024 | Multi-modal |
| text-embedding-3 | OpenAI | 3072 | Latest version |
# Different embedding model comparison
from sentence_transformers import SentenceTransformer
# Lightweight model
light_model = SentenceTransformer('all-MiniLM-L6-v2')
light_dim = light_model.get_sentence_embedding_dimension() # 384
# High-quality model
quality_model = SentenceTransformer('all-mpnet-base-v2')
quality_dim = quality_model.get_sentence_embedding_dimension() # 768
print(f"Lightweight model dimension: {light_dim}, Quality model dimension: {quality_dim}")
4.2 Dimension¶
What is Dimension¶
Dimension refers to the number of numerical elements contained in a vector.
# One-dimensional vector
v1 = [0.5] # 1D
# Two-dimensional vector (point on a plane)
v2 = [0.3, 0.7] # 2D
# Three-dimensional vector (point in space)
v3 = [0.2, 0.5, 0.8] # 3D
# High-dimensional vector (common form in AI)
v384 = [0.1] * 384 # 384D
v1536 = [0.1] * 1536 # 1536D
Dimension and Information Content¶
graph LR
subgraph Dimension and Expressiveness
A["Low Dimension (2-50)"] -->|"Limited information
Weak expressiveness"| B["Can only capture simple features"]
C["Medium Dimension (100-500)"] -->|"Balanced choice
High cost-effectiveness"| D["Rich semantic information"]
E["High Dimension (1000+)"] -->|"Rich information
High storage cost"| F["Fine semantic differences"]
end
style A fill:#ffccbc
style C fill:#c8e6c9
style E fill:#e1f5fe
Practical Impact of Dimension¶
| Dimension Range | Typical Application | Storage (1M vectors) | Search Speed |
|---|---|---|---|
| 128 | Simple semantics | ~512MB | Extremely fast |
| 384 | General text | ~1.5GB | Fast |
| 768 | High-quality text | ~3GB | Medium |
| 1536 | Fine semantics | ~6GB | Relatively slow |
| 3072 | Highest quality | ~12GB | Slow |
# Calculate vector storage space
def calculate_storage(num_vectors, dimension, bytes_per_float=4):
"""Calculate storage space"""
total_bytes = num_vectors * dimension * bytes_per_float
return total_bytes / (1024**3) # Convert to GB
print(f"1M vectors x 384 dimensions = {calculate_storage(1_000_000, 384):.2f} GB")
print(f"1M vectors x 1536 dimensions = {calculate_storage(1_000_000, 1536):.2f} GB")
Curse of Dimensionality¶
When dimensions become very high, the discriminability of distances between vectors decreases, which is the "curse of dimensionality".
# Curse of dimensionality illustration
import numpy as np
def avg_distance_in_unit_hyperball(dim, n_samples=10000):
"""Calculate average distance of random points in unit hyperball"""
points = np.random.uniform(-1, 1, (n_samples, dim))
# Randomly calculate some distances
distances = []
for i in range(min(1000, n_samples)):
for j in range(i+1, min(1000, n_samples)):
distances.append(np.linalg.norm(points[i] - points[j]))
return np.mean(distances)
dims = [2, 10, 50, 100, 500, 1000, 5000]
for d in dims:
avg_dist = avg_distance_in_unit_hyperball(d)
print(f"Dimension {d:4d}: Average distance {avg_dist:.3f}")
# Higher dimensions = smaller differences between all point distances
# This makes it difficult to distinguish "similar" from "dissimilar"
graph TD
subgraph Curse of Dimensionality Illustration
A["Low-dimensional space"] -->|"Dense point distribution
Obvious distance differences"| B["Easy to distinguish similar/dissimilar"]
C["High-dimensional space"] -->|"Sparse point distribution
Distances converge"| D["Difficult to distinguish similar/dissimilar"]
end
style A fill:#c8e6c9
style B fill:#c8e6c9
style C fill:#ffccbc
style D fill:#ffccbc
4.3 Similarity Metrics¶
Similarity Metrics Comparison¶
graph TD
subgraph Euclidean Distance L2
L2_1["√[(a₁-b₁)² + (a₂-b₂)² + ...]"]
L2_2["Smaller values = more similar"]
L2_3["Range: [0, +∞)"]
end
subgraph Cosine Similarity Cosine
C_1["(A · B) / (|A| × |B|)"]
C_2["Larger values = more similar"]
C_3["Range: [-1, 1]"]
end
subgraph Dot Product Dot Product
D_1["A₁B₁ + A₂B₂ + ..."]
D_2["Larger values = more similar"]
D_3["Range: (-∞, +∞)"]
end
subgraph Manhattan Distance L1
M_1["|a₁-b₁| + |a₂-b₂| + ..."]
M_2["Smaller values = more similar"]
M_3["Range: [0, +∞)"]
end
1. Euclidean Distance (L2 Distance)¶
Definition: The square root of the sum of squared differences across all dimensions.
import numpy as np
def euclidean_distance(a, b):
"""Euclidean distance"""
return np.linalg.norm(a - b)
# Example
a = np.array([1, 2, 3])
b = np.array([4, 6, 3])
dist = euclidean_distance(a, b)
print(f"Euclidean distance: {dist:.2f}") # 4.47
# Geometric explanation: in 2D space, this is the straight-line distance between two points
2. Cosine Similarity¶
Definition: The cosine of the angle between two vectors.
def cosine_similarity(a, b):
"""Cosine similarity"""
dot = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot / (norm_a * norm_b)
# Example
a = np.array([1, 1, 0])
b = np.array([1, 1, 1])
c = np.array([-1, -1, 0])
print(f"a vs b: {cosine_similarity(a, b):.3f}") # 0.816 (45° angle)
print(f"a vs c: {cosine_similarity(a, c):.3f}") # -1.000 (180° angle)
3. Dot Product Similarity¶
Definition: Sum of products of corresponding elements.
def dot_product(a, b):
"""Dot product"""
return np.dot(a, b)
# Example
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = dot_product(a, b)
print(f"Dot product: {dot}") # 32 (1*4 + 2*5 + 3*6)
4. Relationship After Normalization¶
When vectors are L2 normalized, dot product is equivalent to cosine similarity:
def normalized_dot_product(a, b):
"""Normalized dot product = cosine similarity"""
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)
return np.dot(a_norm, b_norm)
a = np.array([3, 4]) # Magnitude = 5
b = np.array([4, 3]) # Magnitude = 5
print(f"Cosine similarity: {cosine_similarity(a, b):.3f}") # 0.896
print(f"Normalized dot product: {normalized_dot_product(a, b):.3f}") # 0.896
How to Choose Similarity Metrics¶
| Scenario | Recommended Metric | Reason |
|---|---|---|
| Text semantic search | Cosine similarity | Text vector direction is more important than length |
| Face recognition | Cosine similarity | Ignore lighting effects on vector length |
| Recommendation systems | Dot product | Need to consider user preference intensity |
| Image feature matching | Euclidean distance | Feature intensity itself has meaning |
| Music recommendation | Cosine similarity | Style direction is more important than popularity |
# Practical selection examples
# Scenario 1: Semantic text search
query = "Recommend a good laptop for programming"
doc_embeddings = model.encode(corpus)
query_embedding = model.encode(query)
# Use cosine similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
top_k = np.argsort(similarities)[-5:][::-1]
# Scenario 2: Recommendation system
user_preference = user_embedding # User preference vector
item_features = item_embeddings # Item feature vectors
# Use dot product (considering intensity)
scores = np.dot(user_preference, item_features.T)
top_k = np.argsort(scores)[-5:][::-1]
4.4 Batch Processing and Real-Time Processing¶
Batch Vectorization¶
When processing large amounts of data, batch processing can significantly improve efficiency:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Single processing (slow)
texts = ["Text 1", "Text 2", "Text 3"]
for text in texts:
embedding = model.encode(text) # Encode each separately
# Batch processing (fast)
batch_embeddings = model.encode(texts, batch_size=32) # Encode all at once
print(f"Batch shape: {batch_embeddings.shape}") # (3, 384)
Batch Processing Performance Comparison¶
| Method | 100 texts | 10,000 texts | 1,000,000 texts |
|---|---|---|---|
| Single processing | 2 seconds | 200 seconds | 20,000 seconds |
| Batch processing (32) | 0.5 seconds | 8 seconds | 800 seconds |
| Speedup | 4x | 25x | 25x |
4.5 Concept Quick Reference¶
graph TD
subgraph Core Concepts
E[Embedding
Vector Embedding]
D[Dimension
Dimension]
S[Similarity
Similarity Metric]
I[Index
Index Structure]
N[NNS
Nearest Neighbor Search]
ANN[ANN
Approximate Nearest Neighbor]
end
subgraph Relationships
E -->|Generate| V[Vector
Vector]
V -->|Has| D
D -->|Affects| S
V -->|Build| I
I -->|Accelerate| N
N -->|Approximate version| ANN
end
style E fill:#e1f5fe
style D fill:#fff3e0
style S fill:#c8e6c9
style I fill:#e1f5fe
style N fill:#fff3e0
style ANN fill:#c8e6c9
style V fill:#ffccbc
Key Term Definitions¶
| Term | English | Definition | Example |
|---|---|---|---|
| Vector Embedding | Embedding | Technology that converts data into numerical vectors | embed("apple") → [0.8, 0.1, ...] |
| Dimension | Dimension | Number of elements in a vector | 384D, 1536D |
| Similarity Metric | Similarity Metric | Method to measure similarity between vectors | Cosine similarity, Euclidean distance |
| Index | Index | Data structure that accelerates retrieval | HNSW, IVF |
| Nearest Neighbor | Nearest Neighbor | Most similar vector to query vector | Returns 1 nearest neighbor when K=1 |
| Approximate Nearest Neighbor | ANN | Approximate search trading accuracy for speed | 99% accuracy + 10x speed |
Chapter Summary¶
Key Points:
- Vector Embedding:
- Convert text, images, etc. into numerical vectors
- Semantically similar content is close in vector space
-
Pre-trained models can quickly obtain high-quality embeddings
-
Dimension:
- Higher dimension = stronger expressiveness, but higher storage cost
- Need to find balance between effect and performance
-
Common dimensions: 384 (lightweight), 768 (standard), 1536 (high-quality)
-
Similarity Metrics:
- Cosine similarity: Most commonly used, suitable for text semantic search
- Euclidean distance: Suitable for scenarios where vector magnitude matters
-
Dot product: Suitable for scenarios like recommendation systems that need to consider intensity
-
Performance Optimization:
- Batch processing can significantly improve vectorization efficiency
- Choose appropriate vector dimensions to balance effect and performance
Next Chapter Preview: Chapter 5: Typical Application Scenarios - Learn how to use vector databases in practical applications →