Skip to content

Basic Concepts

Embeddings, Dimensions, Similarity Metrics

4.1 Vector Embedding

What is Vector Embedding

Vector Embedding is the process of converting discrete, high-dimensional raw data (such as text, images, audio) into continuous, low-dimensional dense vectors.

Core Objective: Make "semantically similar" data "close in distance" within the vector space.

# Vectorization example
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Text embedding
texts = ["I want to buy an iPhone", "iPhone has great camera", "The weather is nice today"]
embeddings = model.encode(texts)

print(f"Vector shape: {embeddings.shape}")  # (3, 384) - 3 texts, each 384 dimensions
print(f"First text vector: {embeddings[0][:5]}...")  # [0.12, -0.34, 0.56, ...]

Intuitive Understanding of Embeddings

graph TD
    subgraph Embedding Process
        A["Raw Data"] -->|Embedding Model| B["High-Dimensional Vector"]
    end

    subgraph Different Data Embeddings
        B -->|Text| T["Text Vectors"]
        B -->|Images| I["Image Vectors"]
        B -->|Audio| S["Audio Vectors"]
        B -->|Video| V["Video Vectors"]
    end

    subgraph Vector Space
        T -->|"Close positions"| T1["'Dog' → [0.2, 0.8]"]
        T -->|"Close positions"| T2["'Cat' → [0.2, 0.7]"]
        T1 -->|"Far apart"| T3["'Car' → [0.9, 0.1]"]
    end

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style T fill:#c8e6c9
    style I fill:#c8e6c9
    style S fill:#c8e6c9
    style V fill:#c8e6c9

Key Characteristics of Embeddings

1. Semantic Preservation

Semantically similar content is close in the vector space:

# Semantic similarity → close vector distance
result = cosine_similarity(
    embed("king"),    # [0.8, 0.2, 0.1, ...]
    embed("queen")     # [0.7, 0.3, 0.2, ...]
)
print(f"king vs queen: {result:.3f}")  # ~0.95 (very close)

result = cosine_similarity(
    embed("king"),
    embed("car")
)
print(f"king vs car: {result:.3f}")  # ~0.1 (far apart)

2. Computability

Mathematical operations can be performed between vectors:

# Famous example: king - man + woman ≈ queen
# This perfectly demonstrates the semantic space of vector embeddings

king = embed("king")
man = embed("man")
woman = embed("woman")

# Vector operations
result = king - man + woman

# Verify similarity with queen
queen = embed("queen")
similarity = cosine_similarity(result, queen)
print(f"king - man + woman is close to queen: {similarity:.3f}")  # ~0.8

3. Dimensionality Reduction

Converting high-dimensional sparse data into low-dimensional dense vectors:

Original Data One-Hot Encoding Dimension Post-Embedding Dimension Compression Ratio
English words ~50,000 128-512 100-400x
Image pixels 224×224×3 512-2048 60-300x
Document vocabulary 10,000+ 768-1536 10-15x

Mainstream Embedding Models

Model Developer Vector Dimension Features
Word2Vec Google 100-300 Classic word embedding
BERT Google 768-1024 Context-dependent
Sentence-BERT UKPLab 384-1024 Sentence-level
OpenAI Embeddings OpenAI 1536 API call
CLIP OpenAI 512-1024 Multi-modal
text-embedding-3 OpenAI 3072 Latest version
# Different embedding model comparison
from sentence_transformers import SentenceTransformer

# Lightweight model
light_model = SentenceTransformer('all-MiniLM-L6-v2')
light_dim = light_model.get_sentence_embedding_dimension()  # 384

# High-quality model
quality_model = SentenceTransformer('all-mpnet-base-v2')
quality_dim = quality_model.get_sentence_embedding_dimension()  # 768

print(f"Lightweight model dimension: {light_dim}, Quality model dimension: {quality_dim}")

4.2 Dimension

What is Dimension

Dimension refers to the number of numerical elements contained in a vector.

# One-dimensional vector
v1 = [0.5]  # 1D

# Two-dimensional vector (point on a plane)
v2 = [0.3, 0.7]  # 2D

# Three-dimensional vector (point in space)
v3 = [0.2, 0.5, 0.8]  # 3D

# High-dimensional vector (common form in AI)
v384 = [0.1] * 384  # 384D
v1536 = [0.1] * 1536  # 1536D

Dimension and Information Content

graph LR
    subgraph Dimension and Expressiveness
        A["Low Dimension (2-50)"] -->|"Limited information
Weak expressiveness"| B["Can only capture simple features"] C["Medium Dimension (100-500)"] -->|"Balanced choice
High cost-effectiveness"| D["Rich semantic information"] E["High Dimension (1000+)"] -->|"Rich information
High storage cost"| F["Fine semantic differences"] end style A fill:#ffccbc style C fill:#c8e6c9 style E fill:#e1f5fe

Practical Impact of Dimension

Dimension Range Typical Application Storage (1M vectors) Search Speed
128 Simple semantics ~512MB Extremely fast
384 General text ~1.5GB Fast
768 High-quality text ~3GB Medium
1536 Fine semantics ~6GB Relatively slow
3072 Highest quality ~12GB Slow
# Calculate vector storage space
def calculate_storage(num_vectors, dimension, bytes_per_float=4):
    """Calculate storage space"""
    total_bytes = num_vectors * dimension * bytes_per_float
    return total_bytes / (1024**3)  # Convert to GB

print(f"1M vectors x 384 dimensions = {calculate_storage(1_000_000, 384):.2f} GB")
print(f"1M vectors x 1536 dimensions = {calculate_storage(1_000_000, 1536):.2f} GB")

Curse of Dimensionality

When dimensions become very high, the discriminability of distances between vectors decreases, which is the "curse of dimensionality".

# Curse of dimensionality illustration
import numpy as np

def avg_distance_in_unit_hyperball(dim, n_samples=10000):
    """Calculate average distance of random points in unit hyperball"""
    points = np.random.uniform(-1, 1, (n_samples, dim))
    # Randomly calculate some distances
    distances = []
    for i in range(min(1000, n_samples)):
        for j in range(i+1, min(1000, n_samples)):
            distances.append(np.linalg.norm(points[i] - points[j]))
    return np.mean(distances)

dims = [2, 10, 50, 100, 500, 1000, 5000]
for d in dims:
    avg_dist = avg_distance_in_unit_hyperball(d)
    print(f"Dimension {d:4d}: Average distance {avg_dist:.3f}")

# Higher dimensions = smaller differences between all point distances
# This makes it difficult to distinguish "similar" from "dissimilar"
graph TD
    subgraph Curse of Dimensionality Illustration
        A["Low-dimensional space"] -->|"Dense point distribution
Obvious distance differences"| B["Easy to distinguish similar/dissimilar"] C["High-dimensional space"] -->|"Sparse point distribution
Distances converge"| D["Difficult to distinguish similar/dissimilar"] end style A fill:#c8e6c9 style B fill:#c8e6c9 style C fill:#ffccbc style D fill:#ffccbc

4.3 Similarity Metrics

Similarity Metrics Comparison

graph TD
    subgraph Euclidean Distance L2
        L2_1["√[(a₁-b₁)² + (a₂-b₂)² + ...]"]
        L2_2["Smaller values = more similar"]
        L2_3["Range: [0, +∞)"]
    end

    subgraph Cosine Similarity Cosine
        C_1["(A · B) / (|A| × |B|)"]
        C_2["Larger values = more similar"]
        C_3["Range: [-1, 1]"]
    end

    subgraph Dot Product Dot Product
        D_1["A₁B₁ + A₂B₂ + ..."]
        D_2["Larger values = more similar"]
        D_3["Range: (-∞, +∞)"]
    end

    subgraph Manhattan Distance L1
        M_1["|a₁-b₁| + |a₂-b₂| + ..."]
        M_2["Smaller values = more similar"]
        M_3["Range: [0, +∞)"]
    end

1. Euclidean Distance (L2 Distance)

Definition: The square root of the sum of squared differences across all dimensions.

import numpy as np

def euclidean_distance(a, b):
    """Euclidean distance"""
    return np.linalg.norm(a - b)

# Example
a = np.array([1, 2, 3])
b = np.array([4, 6, 3])

dist = euclidean_distance(a, b)
print(f"Euclidean distance: {dist:.2f}")  # 4.47

# Geometric explanation: in 2D space, this is the straight-line distance between two points

2. Cosine Similarity

Definition: The cosine of the angle between two vectors.

def cosine_similarity(a, b):
    """Cosine similarity"""
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot / (norm_a * norm_b)

# Example
a = np.array([1, 1, 0])
b = np.array([1, 1, 1])
c = np.array([-1, -1, 0])

print(f"a vs b: {cosine_similarity(a, b):.3f}")  # 0.816 (45° angle)
print(f"a vs c: {cosine_similarity(a, c):.3f}")  # -1.000 (180° angle)

3. Dot Product Similarity

Definition: Sum of products of corresponding elements.

def dot_product(a, b):
    """Dot product"""
    return np.dot(a, b)

# Example
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

dot = dot_product(a, b)
print(f"Dot product: {dot}")  # 32 (1*4 + 2*5 + 3*6)

4. Relationship After Normalization

When vectors are L2 normalized, dot product is equivalent to cosine similarity:

def normalized_dot_product(a, b):
    """Normalized dot product = cosine similarity"""
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b)
    return np.dot(a_norm, b_norm)

a = np.array([3, 4])  # Magnitude = 5
b = np.array([4, 3])  # Magnitude = 5

print(f"Cosine similarity: {cosine_similarity(a, b):.3f}")  # 0.896
print(f"Normalized dot product: {normalized_dot_product(a, b):.3f}")  # 0.896

How to Choose Similarity Metrics

Scenario Recommended Metric Reason
Text semantic search Cosine similarity Text vector direction is more important than length
Face recognition Cosine similarity Ignore lighting effects on vector length
Recommendation systems Dot product Need to consider user preference intensity
Image feature matching Euclidean distance Feature intensity itself has meaning
Music recommendation Cosine similarity Style direction is more important than popularity
# Practical selection examples

# Scenario 1: Semantic text search
query = "Recommend a good laptop for programming"
doc_embeddings = model.encode(corpus)
query_embedding = model.encode(query)

# Use cosine similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
top_k = np.argsort(similarities)[-5:][::-1]

# Scenario 2: Recommendation system
user_preference = user_embedding  # User preference vector
item_features = item_embeddings   # Item feature vectors

# Use dot product (considering intensity)
scores = np.dot(user_preference, item_features.T)
top_k = np.argsort(scores)[-5:][::-1]

4.4 Batch Processing and Real-Time Processing

Batch Vectorization

When processing large amounts of data, batch processing can significantly improve efficiency:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single processing (slow)
texts = ["Text 1", "Text 2", "Text 3"]
for text in texts:
    embedding = model.encode(text)  # Encode each separately

# Batch processing (fast)
batch_embeddings = model.encode(texts, batch_size=32)  # Encode all at once
print(f"Batch shape: {batch_embeddings.shape}")  # (3, 384)

Batch Processing Performance Comparison

Method 100 texts 10,000 texts 1,000,000 texts
Single processing 2 seconds 200 seconds 20,000 seconds
Batch processing (32) 0.5 seconds 8 seconds 800 seconds
Speedup 4x 25x 25x

4.5 Concept Quick Reference

graph TD
    subgraph Core Concepts
        E[Embedding
Vector Embedding] D[Dimension
Dimension] S[Similarity
Similarity Metric] I[Index
Index Structure] N[NNS
Nearest Neighbor Search] ANN[ANN
Approximate Nearest Neighbor] end subgraph Relationships E -->|Generate| V[Vector
Vector] V -->|Has| D D -->|Affects| S V -->|Build| I I -->|Accelerate| N N -->|Approximate version| ANN end style E fill:#e1f5fe style D fill:#fff3e0 style S fill:#c8e6c9 style I fill:#e1f5fe style N fill:#fff3e0 style ANN fill:#c8e6c9 style V fill:#ffccbc

Key Term Definitions

Term English Definition Example
Vector Embedding Embedding Technology that converts data into numerical vectors embed("apple")[0.8, 0.1, ...]
Dimension Dimension Number of elements in a vector 384D, 1536D
Similarity Metric Similarity Metric Method to measure similarity between vectors Cosine similarity, Euclidean distance
Index Index Data structure that accelerates retrieval HNSW, IVF
Nearest Neighbor Nearest Neighbor Most similar vector to query vector Returns 1 nearest neighbor when K=1
Approximate Nearest Neighbor ANN Approximate search trading accuracy for speed 99% accuracy + 10x speed

Chapter Summary

Key Points:

  1. Vector Embedding:
  2. Convert text, images, etc. into numerical vectors
  3. Semantically similar content is close in vector space
  4. Pre-trained models can quickly obtain high-quality embeddings

  5. Dimension:

  6. Higher dimension = stronger expressiveness, but higher storage cost
  7. Need to find balance between effect and performance
  8. Common dimensions: 384 (lightweight), 768 (standard), 1536 (high-quality)

  9. Similarity Metrics:

  10. Cosine similarity: Most commonly used, suitable for text semantic search
  11. Euclidean distance: Suitable for scenarios where vector magnitude matters
  12. Dot product: Suitable for scenarios like recommendation systems that need to consider intensity

  13. Performance Optimization:

  14. Batch processing can significantly improve vectorization efficiency
  15. Choose appropriate vector dimensions to balance effect and performance

Next Chapter Preview: Chapter 5: Typical Application Scenarios - Learn how to use vector databases in practical applications →