Skip to content

Collection Management

Overview

Collection is the core abstraction for managing vector data in GoVector. It provides unified interfaces for vector storage, indexing, and search operations, supporting multiple index types and distance metrics.

Class Definition

type Collection struct {
    mu         sync.RWMutex
    name       string
    vectorLen  int
    metric     Distance
    indexType  IndexType
    hnswParams *HNSWParams
    flatIndex  *FlatIndex
    hnswIndex  *HNSWIndex
    storage    *Storage
    quantizer  *SQ8Quantizer
    useQuant   bool
}

Core Operations

Collection Initialization

func NewCollection(opt CollectionOptions) (*Collection, error) {
    if opt.Name == "" {
        return nil, errors.New("collection name cannot be empty")
    }
    if opt.VectorLen <= 0 {
        return nil, errors.New("vector length must be positive")
    }
    if opt.Metric != Cosine && opt.Metric != Euclidean && opt.Metric != Dot {
        return nil, errors.New("invalid distance metric")
    }

    col := &Collection{
        name:       opt.Name,
        vectorLen:  opt.VectorLen,
        metric:     opt.Metric,
        indexType:  opt.IndexType,
        hnswParams: opt.HnswParams,
    }

    // Initialize storage
    storage, err := NewStorage(opt.StoragePath, opt.UseQuantization)
    if err != nil {
        return nil, fmt.Errorf("failed to initialize storage: %w", err)
    }
    col.storage = storage

    // Initialize index based on type
    if opt.IndexType == IndexTypeFlat {
        col.flatIndex = NewFlatIndex(opt.Metric)
    } else if opt.IndexType == IndexTypeHNSW {
        if col.hnswParams == nil {
            col.hnswParams = &DefaultHNSWParams
        }
        hnswIndex, err := NewHNSWIndex(opt.Metric, col.hnswParams)
        if err != nil {
            return nil, fmt.Errorf("failed to initialize HNSW index: %w", err)
        }
        col.hnswIndex = hnswIndex
    }

    // Initialize quantization
    if opt.UseQuantization {
        col.quantizer = NewSQ8Quantizer()
        col.useQuant = true
    }

    // Load existing data from storage
    if err := col.loadFromStorage(); err != nil {
        return nil, fmt.Errorf("failed to load from storage: %w", err)
    }

    return col, nil
}

Upsert Operation

The Upsert operation is used to add or update vectors. When a vector with the same ID exists, it will be updated; otherwise, it will be inserted.

func (c *Collection) Upsert(points []*PointStruct) error {
    c.mu.Lock()
    defer c.mu.Unlock()

    if c.closed {
        return errors.New("collection is closed")
    }

    // Validate all points
    for _, p := range points {
        if len(p.Vector) != c.vectorLen {
            return fmt.Errorf("vector length mismatch: expected %d, got %d",
                c.vectorLen, len(p.Vector))
        }
    }

    // Add to indexes
    if c.flatIndex != nil {
        if err := c.flatIndex.AddPoints(points); err != nil {
            return err
        }
    }
    if c.hnswIndex != nil {
        if err := c.hnswIndex.AddPoints(points); err != nil {
            return err
        }
    }

    // Persist to storage
    if err := c.storage.UpsertPoints(c.name, points); err != nil {
        return fmt.Errorf("failed to persist points: %w", err)
    }

    return nil
}

Search Operation

The Search operation finds the top-K vectors most similar to the query vector.

func (c *Collection) Search(query []float32, filter *Filter, topK int) ([]ScoredPoint, error) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if c.closed {
        return nil, errors.New("collection is closed")
    }

    if len(query) != c.vectorLen {
        return nil, fmt.Errorf("query vector length mismatch: expected %d, got %d",
            c.vectorLen, len(query))
    }

    if topK <= 0 {
        return nil, errors.New("topK must be positive")
    }

    var results []ScoredPoint
    var err error

    if c.flatIndex != nil {
        results, err = c.flatIndex.Search(query, filter, topK)
    } else if c.hnswIndex != nil {
        results, err = c.hnswIndex.Search(query, filter, topK)
    }

    if err != nil {
        return nil, err
    }

    return results, nil
}

Delete Operation

The Delete operation removes vectors from the collection based on ID or filter conditions.

func (c *Collection) Delete(pointIDs []string, filter *Filter) (int, error) {
    c.mu.Lock()
    defer c.mu.Unlock()

    if c.closed {
        return 0, errors.New("collection is closed")
    }

    deletedCount := 0

    // Delete from indexes
    if c.flatIndex != nil {
        count, err := c.flatIndex.Delete(pointIDs, filter)
        if err != nil {
            return deletedCount, err
        }
        deletedCount += count
    }
    if c.hnswIndex != nil {
        count, err := c.hnswIndex.Delete(pointIDs, filter)
        if err != nil {
            return deletedCount, err
        }
        deletedCount += count
    }

    // Delete from storage
    if err := c.storage.DeletePoints(c.name, pointIDs, filter); err != nil {
        return deletedCount, fmt.Errorf("failed to delete from storage: %w", err)
    }

    return deletedCount, nil
}

Count Operation

Returns the total number of vectors in the collection.

func (c *Collection) Count() int {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if c.flatIndex != nil {
        return c.flatIndex.Size()
    }
    if c.hnswIndex != nil {
        return c.hnswIndex.Size()
    }
    return 0
}

Collection Configuration

CollectionOptions

type CollectionOptions struct {
    Name            string
    VectorLen       int
    Metric          Distance
    IndexType       IndexType
    StoragePath     string
    HnswParams      *HNSWParams
    UseQuantization bool
}

Distance Metrics

GoVector supports three distance metrics:

Metric Description Use Case
Cosine Cosine Similarity Text embeddings, semantic search
Euclidean Euclidean Distance Image features, general similarity
Dot Dot Product Unnormalized vectors, recommendation systems

Index Types

Index Type Description Best For
Flat Brute-force exact search Small datasets, accuracy-critical scenarios
HNSW Hierarchical Navigableable Small World Large datasets, latency-critical scenarios

HNSW Parameter Configuration

HNSW index performance depends on the following key parameters:

type HNSWParams struct {
    M              int  // Maximum connections per node (default: 16)
    EfConstruction int  // Candidate list size during construction (default: 200)
    EfSearch       int  // Candidate list size during search (default: 64)
    K              int  // Number of neighbors (default: 10)
}

Parameter tuning recommendations:

  • M: Higher values improve accuracy but increase memory usage and construction time
  • EfConstruction: Higher values improve index quality but increase construction time
  • EfSearch: Higher values improve search accuracy but increase latency

Thread Safety

The Collection class is thread-safe, using read-write locks to support concurrent read operations while ensuring exclusive access for write operations.

Error Handling

Common errors and handling strategies:

Error Cause Handling
Empty collection name Invalid parameter Check before initialization
Vector length mismatch Data inconsistency Validate before upsert
Storage failure Disk issues Check disk space and permissions
Index failure Memory issues Reduce data volume or parameters

Best Practices

  1. Batch Operations: Use batch upsert instead of single inserts for better performance
  2. Index Selection: Choose Flat for datasets under 10,000 vectors, HNSW for larger datasets
  3. Memory Management: Enable quantization for large datasets to reduce memory usage
  4. Filter Optimization: Use early filtering to reduce search scope
  5. Regular Persistence: Important data should be persisted to disk regularly