Collection Management¶

Overview¶

Collection is the core abstraction for managing vector data in GoVector. It provides unified interfaces for vector storage, indexing, and search operations, supporting multiple index types and distance metrics.

Class Definition¶

type Collection struct {
    mu         sync.RWMutex
    name       string
    vectorLen  int
    metric     Distance
    indexType  IndexType
    hnswParams *HNSWParams
    flatIndex  *FlatIndex
    hnswIndex  *HNSWIndex
    storage    *Storage
    quantizer  *SQ8Quantizer
    useQuant   bool
}

Core Operations¶

Collection Initialization¶

func NewCollection(opt CollectionOptions) (*Collection, error) {
    if opt.Name == "" {
        return nil, errors.New("collection name cannot be empty")
    }
    if opt.VectorLen <= 0 {
        return nil, errors.New("vector length must be positive")
    }
    if opt.Metric != Cosine && opt.Metric != Euclidean && opt.Metric != Dot {
        return nil, errors.New("invalid distance metric")
    }

    col := &Collection{
        name:       opt.Name,
        vectorLen:  opt.VectorLen,
        metric:     opt.Metric,
        indexType:  opt.IndexType,
        hnswParams: opt.HnswParams,
    }

    // Initialize storage
    storage, err := NewStorage(opt.StoragePath, opt.UseQuantization)
    if err != nil {
        return nil, fmt.Errorf("failed to initialize storage: %w", err)
    }
    col.storage = storage

    // Initialize index based on type
    if opt.IndexType == IndexTypeFlat {
        col.flatIndex = NewFlatIndex(opt.Metric)
    } else if opt.IndexType == IndexTypeHNSW {
        if col.hnswParams == nil {
            col.hnswParams = &DefaultHNSWParams
        }
        hnswIndex, err := NewHNSWIndex(opt.Metric, col.hnswParams)
        if err != nil {
            return nil, fmt.Errorf("failed to initialize HNSW index: %w", err)
        }
        col.hnswIndex = hnswIndex
    }

    // Initialize quantization
    if opt.UseQuantization {
        col.quantizer = NewSQ8Quantizer()
        col.useQuant = true
    }

    // Load existing data from storage
    if err := col.loadFromStorage(); err != nil {
        return nil, fmt.Errorf("failed to load from storage: %w", err)
    }

    return col, nil
}

Upsert Operation¶

The Upsert operation is used to add or update vectors. When a vector with the same ID exists, it will be updated; otherwise, it will be inserted.

func (c *Collection) Upsert(points []*PointStruct) error {
    c.mu.Lock()
    defer c.mu.Unlock()

    if c.closed {
        return errors.New("collection is closed")
    }

    // Validate all points
    for _, p := range points {
        if len(p.Vector) != c.vectorLen {
            return fmt.Errorf("vector length mismatch: expected %d, got %d",
                c.vectorLen, len(p.Vector))
        }
    }

    // Add to indexes
    if c.flatIndex != nil {
        if err := c.flatIndex.AddPoints(points); err != nil {
            return err
        }
    }
    if c.hnswIndex != nil {
        if err := c.hnswIndex.AddPoints(points); err != nil {
            return err
        }
    }

    // Persist to storage
    if err := c.storage.UpsertPoints(c.name, points); err != nil {
        return fmt.Errorf("failed to persist points: %w", err)
    }

    return nil
}

Search Operation¶

The Search operation finds the top-K vectors most similar to the query vector.

func (c *Collection) Search(query []float32, filter *Filter, topK int) ([]ScoredPoint, error) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if c.closed {
        return nil, errors.New("collection is closed")
    }

    if len(query) != c.vectorLen {
        return nil, fmt.Errorf("query vector length mismatch: expected %d, got %d",
            c.vectorLen, len(query))
    }

    if topK <= 0 {
        return nil, errors.New("topK must be positive")
    }

    var results []ScoredPoint
    var err error

    if c.flatIndex != nil {
        results, err = c.flatIndex.Search(query, filter, topK)
    } else if c.hnswIndex != nil {
        results, err = c.hnswIndex.Search(query, filter, topK)
    }

    if err != nil {
        return nil, err
    }

    return results, nil
}

Delete Operation¶

The Delete operation removes vectors from the collection based on ID or filter conditions.

func (c *Collection) Delete(pointIDs []string, filter *Filter) (int, error) {
    c.mu.Lock()
    defer c.mu.Unlock()

    if c.closed {
        return 0, errors.New("collection is closed")
    }

    deletedCount := 0

    // Delete from indexes
    if c.flatIndex != nil {
        count, err := c.flatIndex.Delete(pointIDs, filter)
        if err != nil {
            return deletedCount, err
        }
        deletedCount += count
    }
    if c.hnswIndex != nil {
        count, err := c.hnswIndex.Delete(pointIDs, filter)
        if err != nil {
            return deletedCount, err
        }
        deletedCount += count
    }

    // Delete from storage
    if err := c.storage.DeletePoints(c.name, pointIDs, filter); err != nil {
        return deletedCount, fmt.Errorf("failed to delete from storage: %w", err)
    }

    return deletedCount, nil
}

Count Operation¶

Returns the total number of vectors in the collection.

func (c *Collection) Count() int {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if c.flatIndex != nil {
        return c.flatIndex.Size()
    }
    if c.hnswIndex != nil {
        return c.hnswIndex.Size()
    }
    return 0
}

Collection Configuration¶

CollectionOptions¶

type CollectionOptions struct {
    Name            string
    VectorLen       int
    Metric          Distance
    IndexType       IndexType
    StoragePath     string
    HnswParams      *HNSWParams
    UseQuantization bool
}

Distance Metrics¶

GoVector supports three distance metrics:

Metric	Description	Use Case
Cosine	Cosine Similarity	Text embeddings, semantic search
Euclidean	Euclidean Distance	Image features, general similarity
Dot	Dot Product	Unnormalized vectors, recommendation systems

Index Types¶

Index Type	Description	Best For
Flat	Brute-force exact search	Small datasets, accuracy-critical scenarios
HNSW	Hierarchical Navigableable Small World	Large datasets, latency-critical scenarios

HNSW Parameter Configuration¶

HNSW index performance depends on the following key parameters:

type HNSWParams struct {
    M              int  // Maximum connections per node (default: 16)
    EfConstruction int  // Candidate list size during construction (default: 200)
    EfSearch       int  // Candidate list size during search (default: 64)
    K              int  // Number of neighbors (default: 10)
}

Parameter tuning recommendations:

M: Higher values improve accuracy but increase memory usage and construction time
EfConstruction: Higher values improve index quality but increase construction time
EfSearch: Higher values improve search accuracy but increase latency

Thread Safety¶

The Collection class is thread-safe, using read-write locks to support concurrent read operations while ensuring exclusive access for write operations.

Error Handling¶

Common errors and handling strategies:

Error	Cause	Handling
Empty collection name	Invalid parameter	Check before initialization
Vector length mismatch	Data inconsistency	Validate before upsert
Storage failure	Disk issues	Check disk space and permissions
Index failure	Memory issues	Reduce data volume or parameters

Best Practices¶

Batch Operations: Use batch upsert instead of single inserts for better performance
Index Selection: Choose Flat for datasets under 10,000 vectors, HNSW for larger datasets
Memory Management: Enable quantization for large datasets to reduce memory usage
Filter Optimization: Use early filtering to reduce search scope
Regular Persistence: Important data should be persisted to disk regularly