Data Model¶

GoVector's data model is designed to be compatible with Qdrant while providing efficient storage and retrieval of vector data. This document describes the core data structures, their relationships, and best practices for working with them.

📊 Core Data Structures¶

Point¶

A Point represents a single vector record with metadata.

{
  "id": "doc_1",
  "vector": [0.1, 0.2, 0.3, /* ... */],
  "payload": {
    "category": "tech",
    "author": "John",
    "timestamp": 1620000000,
    "tags": ["AI", "vector"]
  }
}

Fields¶

id: (required) Unique identifier for the point
vector: (required) Floating-point array representing the vector embedding
payload: (optional) Key-value metadata for filtering and additional context

Scored Point¶

A Scored Point is returned by search operations, including the similarity score.

{
  "id": "doc_1",
  "score": 0.999,
  "payload": {
    "category": "tech",
    "author": "John"
  }
}

Fields¶

id: Unique identifier of the matched point
score: Similarity score based on the distance metric (higher = more similar for Cosine/Dot, lower = more similar for Euclidean)
payload: Metadata of the matched point

Filter¶

A Filter defines conditions for filtering points based on their payload.

{
  "must": [
    {
      "key": "category",
      "type": "exact",
      "match": {"value": "tech"}
    },
    {
      "key": "timestamp",
      "type": "range",
      "range": {"gte": 1620000000, "lte": 1630000000}
    }
  ],
  "must_not": [
    {
      "key": "tags",
      "type": "contains",
      "match": {"value": "deprecated"}
    }
  ]
}

Fields¶

must: Conditions that must be satisfied
must_not: Conditions that must not be satisfied

Condition¶

A Condition defines a single filtering condition.

Types¶

exact: Exact value match
range: Numeric range comparison (gt, gte, lt, lte)
prefix: Prefix match for strings
contains: Contains match for arrays or strings
regex: Regular expression match for strings

🚀 Data Lifecycle¶

Upsert¶

Validation: Check vector dimension matches collection configuration
Versioning: Generate version number for concurrency control
Persistence: Serialize point to Protocol Buffers and write to BoltDB
Indexing: Update in-memory index for fast retrieval

Search¶

Validation: Check query vector dimension
Index Search: Use HNSW or Flat index to find approximate nearest neighbors
Filtering: Apply payload filters to search results
Scoring: Recalculate exact scores for filtered results
Result Return: Return top K results

Delete¶

Target Identification: Determine points to delete by ID or filter
Persistence: Remove points from BoltDB
Indexing: Remove points from in-memory index
Result Return: Return count of deleted points

🔧 Serialization¶

GoVector uses Protocol Buffers for efficient serialization of vector data:

PointStruct: Serialized to bytes and stored in BoltDB
Value Types: Supports string, int64, double, bool, and bytes
Backward Compatibility: Uses proto3 syntax for version compatibility

📈 Distance Metrics¶

GoVector supports three distance metrics for similarity calculation:

Metric	Description	Similarity Interpretation
Cosine	Measures the cosine of the angle between vectors	Higher value = more similar
Euclidean	Measures the straight-line distance between vectors	Lower value = more similar
Dot	Measures the dot product of vectors	Higher value = more similar

Note: Cosine similarity is the default metric and is recommended for most use cases, especially with normalized vectors.

💡 Best Practices¶

Point Design¶

ID: Use stable, meaningful identifiers (e.g., business primary keys)
Vector: Ensure dimension matches collection configuration; normalize vectors for Cosine similarity
Payload: Keep metadata concise; use appropriate types for filtering

Filter Usage¶

Must vs MustNot: Prefer must to narrow results, use must_not sparingly
Condition Types: Use exact for exact matches, range for numeric ranges, prefix for string prefixes, contains for arrays, and regex for complex string patterns
Performance: Avoid overly complex filters, especially with regular expressions

Performance Optimization¶

Index Selection: Use HNSW for large-scale data, Flat for small datasets
Quantization: Enable SQ8 quantization for large-scale data to reduce storage usage
Batch Operations: Use batch upserts for multiple points to improve performance
TopK: Limit the number of results to reduce network overhead

🚩 Common Issues¶

Dimension Mismatch¶

Error: "dimension mismatch"
Solution: Ensure vector length matches the collection's configured dimension

Filter Not Working¶

Possible Causes: Incorrect key name, type mismatch, or non-existent key
Solution: Verify key names and types; remember that non-existent keys pass must_not conditions

Slow Queries¶

Possible Causes: Large dataset with Flat index, complex filters, or too many results
Solution: Use HNSW index, simplify filters, and reduce topK value

📖 Example Usage¶

Creating a Point¶

point := core.PointStruct{
    ID:     "doc_1",
    Vector: []float32{0.1, 0.2, 0.3, /* ... */},
    Payload: core.Payload{
        "category": "tech",
        "author":  "John",
    },
}

Using a Filter¶

filter := &core.Filter{
    Must: []core.Condition{
        {
            Key:   "category",
            Type:  core.MatchTypeExact,
            Match: core.MatchValue{Value: "tech"},
        },
    },
    MustNot: []core.Condition{
        {
            Key:   "author",
            Type:  core.MatchTypeExact,
            Match: core.MatchValue{Value: "Jane"},
        },
    },
}

Searching with Filter¶

results, err := collection.Search(queryVector, filter, 10)
for _, result := range results {
    fmt.Printf("ID: %s, Score: %.4f, Category: %s\n", 
        result.ID, result.Score, result.Payload["category"])
}

🎯 Conclusion¶

GoVector's data model provides a flexible and efficient way to store and retrieve vector data. By following best practices for point design, filter usage, and performance optimization, you can build high-performance vector search applications that scale with your needs.

API Reference - Complete API documentation
Collection Management - Collection creation and management
Indexing - HNSW and Flat index implementation
Storage - Persistence and serialization details

Data Model¶

📊 Core Data Structures¶

Point¶

Fields¶

Scored Point¶

Fields¶

Filter¶

Fields¶

Condition¶

Types¶

🚀 Data Lifecycle¶

Upsert¶

Search¶

Delete¶

🔧 Serialization¶

📈 Distance Metrics¶

💡 Best Practices¶

Point Design¶

Filter Usage¶

Performance Optimization¶

🚩 Common Issues¶

Dimension Mismatch¶

Filter Not Working¶

Slow Queries¶

📖 Example Usage¶

Creating a Point¶

Using a Filter¶

Searching with Filter¶

🎯 Conclusion¶

🔗 Related Documentation¶