Data Model¶
GoVector's data model is designed to be compatible with Qdrant while providing efficient storage and retrieval of vector data. This document describes the core data structures, their relationships, and best practices for working with them.
π Core Data Structures¶
Point¶
A Point represents a single vector record with metadata.
{
"id": "doc_1",
"vector": [0.1, 0.2, 0.3, /* ... */],
"payload": {
"category": "tech",
"author": "John",
"timestamp": 1620000000,
"tags": ["AI", "vector"]
}
}
Fields¶
- id: (required) Unique identifier for the point
- vector: (required) Floating-point array representing the vector embedding
- payload: (optional) Key-value metadata for filtering and additional context
Scored Point¶
A Scored Point is returned by search operations, including the similarity score.
Fields¶
- id: Unique identifier of the matched point
- score: Similarity score based on the distance metric (higher = more similar for Cosine/Dot, lower = more similar for Euclidean)
- payload: Metadata of the matched point
Filter¶
A Filter defines conditions for filtering points based on their payload.
{
"must": [
{
"key": "category",
"type": "exact",
"match": {"value": "tech"}
},
{
"key": "timestamp",
"type": "range",
"range": {"gte": 1620000000, "lte": 1630000000}
}
],
"must_not": [
{
"key": "tags",
"type": "contains",
"match": {"value": "deprecated"}
}
]
}
Fields¶
- must: Conditions that must be satisfied
- must_not: Conditions that must not be satisfied
Condition¶
A Condition defines a single filtering condition.
Types¶
- exact: Exact value match
- range: Numeric range comparison (gt, gte, lt, lte)
- prefix: Prefix match for strings
- contains: Contains match for arrays or strings
- regex: Regular expression match for strings
π Data Lifecycle¶
Upsert¶
- Validation: Check vector dimension matches collection configuration
- Versioning: Generate version number for concurrency control
- Persistence: Serialize point to Protocol Buffers and write to BoltDB
- Indexing: Update in-memory index for fast retrieval
Search¶
- Validation: Check query vector dimension
- Index Search: Use HNSW or Flat index to find approximate nearest neighbors
- Filtering: Apply payload filters to search results
- Scoring: Recalculate exact scores for filtered results
- Result Return: Return top K results
Delete¶
- Target Identification: Determine points to delete by ID or filter
- Persistence: Remove points from BoltDB
- Indexing: Remove points from in-memory index
- Result Return: Return count of deleted points
π§ Serialization¶
GoVector uses Protocol Buffers for efficient serialization of vector data:
- PointStruct: Serialized to bytes and stored in BoltDB
- Value Types: Supports string, int64, double, bool, and bytes
- Backward Compatibility: Uses proto3 syntax for version compatibility
π Distance Metrics¶
GoVector supports three distance metrics for similarity calculation:
| Metric | Description | Similarity Interpretation |
|---|---|---|
| Cosine | Measures the cosine of the angle between vectors | Higher value = more similar |
| Euclidean | Measures the straight-line distance between vectors | Lower value = more similar |
| Dot | Measures the dot product of vectors | Higher value = more similar |
Note: Cosine similarity is the default metric and is recommended for most use cases, especially with normalized vectors.
π‘ Best Practices¶
Point Design¶
- ID: Use stable, meaningful identifiers (e.g., business primary keys)
- Vector: Ensure dimension matches collection configuration; normalize vectors for Cosine similarity
- Payload: Keep metadata concise; use appropriate types for filtering
Filter Usage¶
- Must vs MustNot: Prefer
mustto narrow results, usemust_notsparingly - Condition Types: Use
exactfor exact matches,rangefor numeric ranges,prefixfor string prefixes,containsfor arrays, andregexfor complex string patterns - Performance: Avoid overly complex filters, especially with regular expressions
Performance Optimization¶
- Index Selection: Use HNSW for large-scale data, Flat for small datasets
- Quantization: Enable SQ8 quantization for large-scale data to reduce storage usage
- Batch Operations: Use batch upserts for multiple points to improve performance
- TopK: Limit the number of results to reduce network overhead
π© Common Issues¶
Dimension Mismatch¶
- Error: "dimension mismatch"
- Solution: Ensure vector length matches the collection's configured dimension
Filter Not Working¶
- Possible Causes: Incorrect key name, type mismatch, or non-existent key
- Solution: Verify key names and types; remember that non-existent keys pass
must_notconditions
Slow Queries¶
- Possible Causes: Large dataset with Flat index, complex filters, or too many results
- Solution: Use HNSW index, simplify filters, and reduce topK value
π Example Usage¶
Creating a Point¶
point := core.PointStruct{
ID: "doc_1",
Vector: []float32{0.1, 0.2, 0.3, /* ... */},
Payload: core.Payload{
"category": "tech",
"author": "John",
},
}
Using a Filter¶
filter := &core.Filter{
Must: []core.Condition{
{
Key: "category",
Type: core.MatchTypeExact,
Match: core.MatchValue{Value: "tech"},
},
},
MustNot: []core.Condition{
{
Key: "author",
Type: core.MatchTypeExact,
Match: core.MatchValue{Value: "Jane"},
},
},
}
Searching with Filter¶
results, err := collection.Search(queryVector, filter, 10)
for _, result := range results {
fmt.Printf("ID: %s, Score: %.4f, Category: %s\n",
result.ID, result.Score, result.Payload["category"])
}
π― Conclusion¶
GoVector's data model provides a flexible and efficient way to store and retrieve vector data. By following best practices for point design, filter usage, and performance optimization, you can build high-performance vector search applications that scale with your needs.
π Related Documentation¶
- API Reference - Complete API documentation
- Collection Management - Collection creation and management
- Indexing - HNSW and Flat index implementation
- Storage - Persistence and serialization details