Collection Management¶
Overview¶
Collection is the core abstraction for managing vector data in GoVector. It provides unified interfaces for vector storage, indexing, and search operations, supporting multiple index types and distance metrics.
Class Definition¶
type Collection struct {
mu sync.RWMutex
name string
vectorLen int
metric Distance
indexType IndexType
hnswParams *HNSWParams
flatIndex *FlatIndex
hnswIndex *HNSWIndex
storage *Storage
quantizer *SQ8Quantizer
useQuant bool
}
Core Operations¶
Collection Initialization¶
func NewCollection(opt CollectionOptions) (*Collection, error) {
if opt.Name == "" {
return nil, errors.New("collection name cannot be empty")
}
if opt.VectorLen <= 0 {
return nil, errors.New("vector length must be positive")
}
if opt.Metric != Cosine && opt.Metric != Euclidean && opt.Metric != Dot {
return nil, errors.New("invalid distance metric")
}
col := &Collection{
name: opt.Name,
vectorLen: opt.VectorLen,
metric: opt.Metric,
indexType: opt.IndexType,
hnswParams: opt.HnswParams,
}
// Initialize storage
storage, err := NewStorage(opt.StoragePath, opt.UseQuantization)
if err != nil {
return nil, fmt.Errorf("failed to initialize storage: %w", err)
}
col.storage = storage
// Initialize index based on type
if opt.IndexType == IndexTypeFlat {
col.flatIndex = NewFlatIndex(opt.Metric)
} else if opt.IndexType == IndexTypeHNSW {
if col.hnswParams == nil {
col.hnswParams = &DefaultHNSWParams
}
hnswIndex, err := NewHNSWIndex(opt.Metric, col.hnswParams)
if err != nil {
return nil, fmt.Errorf("failed to initialize HNSW index: %w", err)
}
col.hnswIndex = hnswIndex
}
// Initialize quantization
if opt.UseQuantization {
col.quantizer = NewSQ8Quantizer()
col.useQuant = true
}
// Load existing data from storage
if err := col.loadFromStorage(); err != nil {
return nil, fmt.Errorf("failed to load from storage: %w", err)
}
return col, nil
}
Upsert Operation¶
The Upsert operation is used to add or update vectors. When a vector with the same ID exists, it will be updated; otherwise, it will be inserted.
func (c *Collection) Upsert(points []*PointStruct) error {
c.mu.Lock()
defer c.mu.Unlock()
if c.closed {
return errors.New("collection is closed")
}
// Validate all points
for _, p := range points {
if len(p.Vector) != c.vectorLen {
return fmt.Errorf("vector length mismatch: expected %d, got %d",
c.vectorLen, len(p.Vector))
}
}
// Add to indexes
if c.flatIndex != nil {
if err := c.flatIndex.AddPoints(points); err != nil {
return err
}
}
if c.hnswIndex != nil {
if err := c.hnswIndex.AddPoints(points); err != nil {
return err
}
}
// Persist to storage
if err := c.storage.UpsertPoints(c.name, points); err != nil {
return fmt.Errorf("failed to persist points: %w", err)
}
return nil
}
Search Operation¶
The Search operation finds the top-K vectors most similar to the query vector.
func (c *Collection) Search(query []float32, filter *Filter, topK int) ([]ScoredPoint, error) {
c.mu.RLock()
defer c.mu.RUnlock()
if c.closed {
return nil, errors.New("collection is closed")
}
if len(query) != c.vectorLen {
return nil, fmt.Errorf("query vector length mismatch: expected %d, got %d",
c.vectorLen, len(query))
}
if topK <= 0 {
return nil, errors.New("topK must be positive")
}
var results []ScoredPoint
var err error
if c.flatIndex != nil {
results, err = c.flatIndex.Search(query, filter, topK)
} else if c.hnswIndex != nil {
results, err = c.hnswIndex.Search(query, filter, topK)
}
if err != nil {
return nil, err
}
return results, nil
}
Delete Operation¶
The Delete operation removes vectors from the collection based on ID or filter conditions.
func (c *Collection) Delete(pointIDs []string, filter *Filter) (int, error) {
c.mu.Lock()
defer c.mu.Unlock()
if c.closed {
return 0, errors.New("collection is closed")
}
deletedCount := 0
// Delete from indexes
if c.flatIndex != nil {
count, err := c.flatIndex.Delete(pointIDs, filter)
if err != nil {
return deletedCount, err
}
deletedCount += count
}
if c.hnswIndex != nil {
count, err := c.hnswIndex.Delete(pointIDs, filter)
if err != nil {
return deletedCount, err
}
deletedCount += count
}
// Delete from storage
if err := c.storage.DeletePoints(c.name, pointIDs, filter); err != nil {
return deletedCount, fmt.Errorf("failed to delete from storage: %w", err)
}
return deletedCount, nil
}
Count Operation¶
Returns the total number of vectors in the collection.
func (c *Collection) Count() int {
c.mu.RLock()
defer c.mu.RUnlock()
if c.flatIndex != nil {
return c.flatIndex.Size()
}
if c.hnswIndex != nil {
return c.hnswIndex.Size()
}
return 0
}
Collection Configuration¶
CollectionOptions¶
type CollectionOptions struct {
Name string
VectorLen int
Metric Distance
IndexType IndexType
StoragePath string
HnswParams *HNSWParams
UseQuantization bool
}
Distance Metrics¶
GoVector supports three distance metrics:
| Metric | Description | Use Case |
|---|---|---|
| Cosine | Cosine Similarity | Text embeddings, semantic search |
| Euclidean | Euclidean Distance | Image features, general similarity |
| Dot | Dot Product | Unnormalized vectors, recommendation systems |
Index Types¶
| Index Type | Description | Best For |
|---|---|---|
| Flat | Brute-force exact search | Small datasets, accuracy-critical scenarios |
| HNSW | Hierarchical Navigableable Small World | Large datasets, latency-critical scenarios |
HNSW Parameter Configuration¶
HNSW index performance depends on the following key parameters:
type HNSWParams struct {
M int // Maximum connections per node (default: 16)
EfConstruction int // Candidate list size during construction (default: 200)
EfSearch int // Candidate list size during search (default: 64)
K int // Number of neighbors (default: 10)
}
Parameter tuning recommendations:
- M: Higher values improve accuracy but increase memory usage and construction time
- EfConstruction: Higher values improve index quality but increase construction time
- EfSearch: Higher values improve search accuracy but increase latency
Thread Safety¶
The Collection class is thread-safe, using read-write locks to support concurrent read operations while ensuring exclusive access for write operations.
Error Handling¶
Common errors and handling strategies:
| Error | Cause | Handling |
|---|---|---|
| Empty collection name | Invalid parameter | Check before initialization |
| Vector length mismatch | Data inconsistency | Validate before upsert |
| Storage failure | Disk issues | Check disk space and permissions |
| Index failure | Memory issues | Reduce data volume or parameters |
Best Practices¶
- Batch Operations: Use batch upsert instead of single inserts for better performance
- Index Selection: Choose Flat for datasets under 10,000 vectors, HNSW for larger datasets
- Memory Management: Enable quantization for large datasets to reduce memory usage
- Filter Optimization: Use early filtering to reduce search scope
- Regular Persistence: Important data should be persisted to disk regularly