Skip to content

Vector Database FundamentalsΒΆ

This document provides an introduction to the fundamental concepts of vector databases, with a focus on how GoVector implements these concepts.

πŸ“‹ What is a Vector Database?ΒΆ

A vector database is a specialized type of database designed to store and retrieve vector embeddings efficiently. Unlike traditional databases that store structured data, vector databases are optimized for similarity search operations.

Key CharacteristicsΒΆ

  • Vector Storage: Optimized for storing high-dimensional vectors
  • Similarity Search: Fast nearest neighbor search based on vector similarity
  • Metadata Filtering: Combine vector search with traditional metadata filtering
  • Scalability: Handle large collections of vectors efficiently

πŸš€ Vector EmbeddingsΒΆ

What are Vector Embeddings?ΒΆ

Vector embeddings are numerical representations of objects (text, images, audio, etc.) in a high-dimensional space. They capture semantic meaning, where similar objects are represented by vectors that are close to each other.

Common Embedding ModelsΒΆ

Model Type Example Models Vector Dimension Use Case
Text BERT, Sentence-BERT, GPT 768-1536 Text similarity, search
Image ResNet, CLIP 512-2048 Image search, classification
Audio VGGish, Whisper 128-768 Audio search, transcription

πŸ”§ Distance MetricsΒΆ

Distance metrics measure the similarity between vectors. GoVector supports three primary metrics:

Cosine SimilarityΒΆ

Formula: cos(ΞΈ) = (A Β· B) / (||A|| ||B||)

Range: [-1, 1], where 1 means identical

Best for: Normalized vectors, direction-sensitive tasks

Euclidean DistanceΒΆ

Formula: d(A, B) = √(Σ(Ai - Bi)²)

Range: [0, ∞), where 0 means identical

Best for: Unnormalized vectors, absolute distance-sensitive tasks

Dot ProductΒΆ

Formula: A Β· B = Ξ£(Ai Γ— Bi)

Range: (-∞, ∞), where higher values mean more similar

Best for: Some specialized use cases, normalized vectors

πŸ“Š Indexing AlgorithmsΒΆ

Indexing algorithms optimize the search process for large collections of vectors.

Flat IndexΒΆ

Description: Brute-force search that calculates distance to every vector

Pros: Exact results, simple implementation

Cons: Slow for large datasets (O(n) complexity)

Best for: Small datasets (< 10,000 vectors)

HNSW (Hierarchical Navigable Small World)ΒΆ

Description: Multi-layer graph structure for approximate nearest neighbor search

Pros: Fast search (O(log n) complexity), high recall

Cons: Higher memory usage, slower insertion

Best for: Large datasets (> 10,000 vectors)

πŸ’‘ Vector QuantizationΒΆ

Vector quantization reduces the memory footprint of vectors by compressing them.

SQ8 QuantizationΒΆ

Description: Scalar quantization that reduces each float32 (4 bytes) to int8 (1 byte)

Compression Ratio: 4:1 (75% reduction)

Impact: Minimal impact on search quality, significant memory savings

Use Case: Large datasets where memory is constrained

🎯 Use Cases¢

Vector databases are used in a variety of applications:

  • Text Search: Find documents similar to a query
  • Image Search: Find images similar to a query image
  • Recommendation Systems: Recommend items based on user preferences

Clustering and ClassificationΒΆ

  • Anomaly Detection: Identify outliers in vector space
  • Clustering: Group similar items together
  • Classification: Assign labels based on nearest neighbors

Natural Language ProcessingΒΆ

  • Semantic Search: Understand the meaning behind queries
  • Question Answering: Find relevant information for questions
  • Text Summarization: Identify key concepts in text

πŸ”— How GoVector Implements These ConceptsΒΆ

Core ComponentsΒΆ

  • Collection: Container for vectors with specific configuration
  • Point: Individual vector with metadata
  • Index: Data structure for efficient search
  • Storage: Persistence layer for vector data

ArchitectureΒΆ

graph TD
    A[Client Application] --> B[GoVector API]
    B --> C[Collection Management]
    C --> D[Indexing Engine]
    C --> E[Storage Engine]
    D --> F[HNSW Index]
    D --> G[Flat Index]
    E --> H[BoltDB Storage]
    E --> I[Protocol Buffers]
    E --> J[SQ8 Quantization]

Key FeaturesΒΆ

  • Qdrant Compatibility: Drop-in replacement for Qdrant API
  • Dual Modes: Embedded library and standalone microservice
  • High Performance: HNSW indexing for fast search
  • Memory Efficiency: SQ8 quantization for memory optimization
  • Persistence: BoltDB for reliable storage

πŸ“ˆ Performance ConsiderationsΒΆ

Vector DimensionΒΆ

  • Smaller dimensions (128-512): Faster search, lower memory usage
  • Larger dimensions (768+): Better representation, higher memory usage

Index ParametersΒΆ

  • HNSW M: Number of connections per node (4-64)
  • EfConstruction: Index quality vs. build time (100-1000)
  • EfSearch: Search speed vs. recall (10-100)

Batch OperationsΒΆ

  • Upsert: Batch multiple points for better performance
  • Search: Limit result set size for faster responses
  • Delete: Use filters for efficient bulk deletion

🚩 Best Practices¢

Data PreparationΒΆ

  • Normalize vectors for cosine similarity
  • Use appropriate embedding models for your use case
  • Clean and preprocess data before embedding

Collection DesignΒΆ

  • Choose appropriate vector dimension based on your embedding model
  • Select the right distance metric for your use case
  • Configure index parameters based on dataset size

Query OptimizationΒΆ

  • Use filters to narrow search space
  • Limit result set size to improve performance
  • Batch multiple queries when possible