Skip to content

Data Model

GoVector's data model is designed to be compatible with Qdrant while providing efficient storage and retrieval of vector data. This document describes the core data structures, their relationships, and best practices for working with them.

πŸ“Š Core Data Structures

Point

A Point represents a single vector record with metadata.

{
  "id": "doc_1",
  "vector": [0.1, 0.2, 0.3, /* ... */],
  "payload": {
    "category": "tech",
    "author": "John",
    "timestamp": 1620000000,
    "tags": ["AI", "vector"]
  }
}

Fields

  • id: (required) Unique identifier for the point
  • vector: (required) Floating-point array representing the vector embedding
  • payload: (optional) Key-value metadata for filtering and additional context

Scored Point

A Scored Point is returned by search operations, including the similarity score.

{
  "id": "doc_1",
  "score": 0.999,
  "payload": {
    "category": "tech",
    "author": "John"
  }
}

Fields

  • id: Unique identifier of the matched point
  • score: Similarity score based on the distance metric (higher = more similar for Cosine/Dot, lower = more similar for Euclidean)
  • payload: Metadata of the matched point

Filter

A Filter defines conditions for filtering points based on their payload.

{
  "must": [
    {
      "key": "category",
      "type": "exact",
      "match": {"value": "tech"}
    },
    {
      "key": "timestamp",
      "type": "range",
      "range": {"gte": 1620000000, "lte": 1630000000}
    }
  ],
  "must_not": [
    {
      "key": "tags",
      "type": "contains",
      "match": {"value": "deprecated"}
    }
  ]
}

Fields

  • must: Conditions that must be satisfied
  • must_not: Conditions that must not be satisfied

Condition

A Condition defines a single filtering condition.

Types

  • exact: Exact value match
  • range: Numeric range comparison (gt, gte, lt, lte)
  • prefix: Prefix match for strings
  • contains: Contains match for arrays or strings
  • regex: Regular expression match for strings

πŸš€ Data Lifecycle

Upsert

  1. Validation: Check vector dimension matches collection configuration
  2. Versioning: Generate version number for concurrency control
  3. Persistence: Serialize point to Protocol Buffers and write to BoltDB
  4. Indexing: Update in-memory index for fast retrieval
  1. Validation: Check query vector dimension
  2. Index Search: Use HNSW or Flat index to find approximate nearest neighbors
  3. Filtering: Apply payload filters to search results
  4. Scoring: Recalculate exact scores for filtered results
  5. Result Return: Return top K results

Delete

  1. Target Identification: Determine points to delete by ID or filter
  2. Persistence: Remove points from BoltDB
  3. Indexing: Remove points from in-memory index
  4. Result Return: Return count of deleted points

πŸ”§ Serialization

GoVector uses Protocol Buffers for efficient serialization of vector data:

  • PointStruct: Serialized to bytes and stored in BoltDB
  • Value Types: Supports string, int64, double, bool, and bytes
  • Backward Compatibility: Uses proto3 syntax for version compatibility

πŸ“ˆ Distance Metrics

GoVector supports three distance metrics for similarity calculation:

Metric Description Similarity Interpretation
Cosine Measures the cosine of the angle between vectors Higher value = more similar
Euclidean Measures the straight-line distance between vectors Lower value = more similar
Dot Measures the dot product of vectors Higher value = more similar

Note: Cosine similarity is the default metric and is recommended for most use cases, especially with normalized vectors.

πŸ’‘ Best Practices

Point Design

  • ID: Use stable, meaningful identifiers (e.g., business primary keys)
  • Vector: Ensure dimension matches collection configuration; normalize vectors for Cosine similarity
  • Payload: Keep metadata concise; use appropriate types for filtering

Filter Usage

  • Must vs MustNot: Prefer must to narrow results, use must_not sparingly
  • Condition Types: Use exact for exact matches, range for numeric ranges, prefix for string prefixes, contains for arrays, and regex for complex string patterns
  • Performance: Avoid overly complex filters, especially with regular expressions

Performance Optimization

  • Index Selection: Use HNSW for large-scale data, Flat for small datasets
  • Quantization: Enable SQ8 quantization for large-scale data to reduce storage usage
  • Batch Operations: Use batch upserts for multiple points to improve performance
  • TopK: Limit the number of results to reduce network overhead

🚩 Common Issues

Dimension Mismatch

  • Error: "dimension mismatch"
  • Solution: Ensure vector length matches the collection's configured dimension

Filter Not Working

  • Possible Causes: Incorrect key name, type mismatch, or non-existent key
  • Solution: Verify key names and types; remember that non-existent keys pass must_not conditions

Slow Queries

  • Possible Causes: Large dataset with Flat index, complex filters, or too many results
  • Solution: Use HNSW index, simplify filters, and reduce topK value

πŸ“– Example Usage

Creating a Point

point := core.PointStruct{
    ID:     "doc_1",
    Vector: []float32{0.1, 0.2, 0.3, /* ... */},
    Payload: core.Payload{
        "category": "tech",
        "author":  "John",
    },
}

Using a Filter

filter := &core.Filter{
    Must: []core.Condition{
        {
            Key:   "category",
            Type:  core.MatchTypeExact,
            Match: core.MatchValue{Value: "tech"},
        },
    },
    MustNot: []core.Condition{
        {
            Key:   "author",
            Type:  core.MatchTypeExact,
            Match: core.MatchValue{Value: "Jane"},
        },
    },
}

Searching with Filter

results, err := collection.Search(queryVector, filter, 10)
for _, result := range results {
    fmt.Printf("ID: %s, Score: %.4f, Category: %s\n", 
        result.ID, result.Score, result.Payload["category"])
}

🎯 Conclusion

GoVector's data model provides a flexible and efficient way to store and retrieve vector data. By following best practices for point design, filter usage, and performance optimization, you can build high-performance vector search applications that scale with your needs.