Data Model¶
This documentation systematically outlines the data model of GoVector, focusing on the following core structures: PointStruct, ScoredPoint, Filter, Payload and their mapping in Protocol Buffers; explains field semantics, data types, validation rules, and business constraints; describes Protocol Buffer message formats, serialization mechanisms, and version compatibility; demonstrates the combined usage of vectors, metadata, and filter conditions; and provides guidance on data lifecycle management, caching strategies, performance considerations, as well as practical usage recommendations and troubleshooting.
Project Structure and Data Model Positioning¶
- Protocol Buffer definitions are located in core/proto/point.proto, with generated Go code at core/proto/point.pb.go.
- Core Go data models are located in core/models.go, containing Go structs corresponding to protobuf and filter logic.
- The storage layer is responsible for persistence and deserialization, involving Protobuf serialization, BoltDB access, and optional vector quantization.
- The collection layer (Collection) is responsible for index selection and orchestration of Upsert/Search/Delete operations.
- Math utilities provide distance metrics (Cosine/Euclid/Dot) for similarity calculation and ranking.
graph TB
subgraph "Protocol Buffers"
P1["point.proto
Message definitions"]
P2["point.pb.go
Generated Go code"]
end
subgraph "Core Models"
M1["models.go
PointStruct/ScoredPoint/Filter/Payload"]
M2["models_test.go
Filter unit tests"]
end
subgraph "Storage and Serialization"
S1["storage.go
PB serialization/deserialization
BoltDB persistence"]
end
subgraph "Collection and Index"
C1["collection.go
Collection orchestration"]
I1["index.go
VectorIndex interface"]
end
subgraph "Math Utilities"
D1["math.go
Distance metric implementations"]
end
P1 --> P2
P2 --> S1
M1 --> S1
M1 --> C1
C1 --> I1
C1 --> D1
S1 --> C1
Core Data Structures Overview¶
- PointStruct: A single vector record, containing a unique identifier, vector array, version number, and optional metadata.
- ScoredPoint: A retrieval result, containing the matched point's identifier, version, similarity score, and metadata.
- Filter: A filter condition container, supporting must/must_not condition lists.
- Payload: Key-value pair metadata with value type any, mapped to specific types in persistence via Protobuf oneof.
- Condition/MatchValue/RangeValue: Condition definitions for filters, supporting exact match, range, prefix, contains, regex, etc.
Protocol Buffers Definitions¶
- Message Structure
- PointStruct: Contains id, vector (float list), payload (map from string to Value).
- Value: oneof contains string/int64/double/bool/bytes six basic types.
- ScoredPoint: Contains id, version, score, payload.
- Condition: Contains key and MatchValue.
- MatchValue: oneof contains string/int64/double/bool/bytes.
- Filter: Contains must and must_not two Condition lists.
- Serialization and Deserialization
- The storage layer uses google.golang.org/protobuf/proto for Marshal/Unmarshal.
- On read: first read bytes from BoltDB, then Unmarshal to PB message, then convert to Go struct.
- On write: first convert Go struct to PB message, then Marshal to bytes and write.
- Version Compatibility
- Uses proto3 syntax, new fields are added as optional to not break existing binary format.
- oneof fields ensure only one type is set at a time, avoiding ambiguity.
- Generated point.pb.go provides standard serialization interfaces and field accessors.
Data Model Details¶
PointStruct (Vector Record)¶
- Fields and Types
id:string, unique identifier (stable string or numeric string recommended).version:uint64, version number for consistency and concurrency control.vector:[]float32, vector embedding, dimension must match collection configuration.payload:map[string]any, optional metadata, supports arbitrary JSON-compatible types.- Business Constraints
- Vector length must match the collection's VectorLen, otherwise insertion fails.
- Version numbers are generated uniformly by the Upsert process, ensuring idempotency and ordering.
- Payload is converted to Protobuf Value during persistence; only oneof-supported types are preserved.
- Validation Rules
- Vector dimension is validated before insertion; unsupported types are ignored during persistence.
- Performance Impact
- Large vectors increase memory and disk usage; can be combined with SQ8 quantization to reduce storage volume.
ScoredPoint (Retrieval Result)¶
- Fields and Types
id:string, matched point's identifier.version:uint64, matched point's version.score:float32, similarity or distance score (determined by selected metric - higher/lower = more similar depending on metric).payload:map[string]any, optional metadata.- Business Constraints
- Score is calculated by the underlying index based on Metric; for Cosine/Dot, higher is more similar; for Euclid, lower is more similar.
- Return count is limited by topK.
- Performance Impact
- Result set size directly affects memory and network transmission overhead.
Filter¶
- Composition
must:[]Condition, all conditions must be satisfied.must_not:[]Condition, no condition is allowed to be satisfied.- Condition
key, key name in payload.value, value in payload.type: enum, supports exact, range, prefix, contains, regex.match:MatchValue, used for exact match or value for prefix/contains/regex.range:RangeValue, used for numeric range matching (gt/gte/lt/lte).- Evaluation Rules
- If filter is empty, considered as matching all.
- If any condition in must list is not satisfied, the overall result is not matched.
- If any condition in must_not list is satisfied, the overall result is not matched.
- Non-existent keys are considered not matched in must, and matched in must_not (i.e., "non-existence passes").
- Performance Impact
- Filters are executed at the index layer; must_not is usually more expensive; recommend using must to narrow range.
Payload (Metadata)¶
- Types and Usage
map[string]any, used for retrieval filtering and secondary processing.- Converted to Protobuf Value during persistence; only oneof-supported types are saved.
- Common Key Suggestions
- Category, price, boolean status, tag arrays, timestamps, etc.
- Notes
- Unsupported types are ignored during persistence.
- Filter behavior for prefix/regex/contains matching on non-string values has specific characteristics (see following sections).
Data Flow and Lifecycle¶
Upsert Flow¶
sequenceDiagram
participant App as "Application"
participant Col as "Collection"
participant St as "Storage"
participant Idx as "VectorIndex"
App->>Col : Upsert(points)
Col->>Col : Validate vector dimension/generate version
Col->>St : UpsertPoints (serialize PB by ID)
St-->>Col : Success/Failure
Col->>Idx : Upsert(points)
Idx-->>Col : Success/Failure
Col-->>App : Return result
Search Flow¶
sequenceDiagram
participant App as "Application"
participant Col as "Collection"
participant Idx as "VectorIndex"
App->>Col : Search(query, filter, topK)
Col->>Col : Validate query vector dimension
Col->>Idx : Search(query, filter, topK)
Idx-->>Col : []ScoredPoint
Col-->>App : Return result
Delete Flow¶
sequenceDiagram
participant App as "Application"
participant Col as "Collection"
participant Idx as "VectorIndex"
participant St as "Storage"
App->>Col : Delete(ids or filter)
Col->>Col : Parse target ID list
Col->>St : DeletePoints (delete by ID)
St-->>Col : Success/Failure
Col->>Idx : Delete(each id)
Idx-->>Col : Success/Failure
Col-->>App : Return deleted count
Load and Recovery¶
- Load collection metadata and point set on startup, validate dimension consistency, batch rebuild in-memory index.
- If quantization is enabled, automatically dequantize vectors on load.
Architecture Relationship Diagram¶
classDiagram
class PointStruct {
+string id
+uint64 version
+[]float32 vector
+Payload payload
}
class ScoredPoint {
+string id
+uint64 version
+float32 score
+Payload payload
}
class Filter {
+[]Condition must
+[]Condition must_not
}
class Condition {
+string key
+ConditionType type
+MatchValue match
+RangeValue range
}
class MatchValue {
+any value
}
class RangeValue {
+any gt
+any gte
+any lt
+any lte
}
class Payload {
<
Filter and Query Logic¶
Filter Evaluation Algorithm¶
flowchart TD
Start(["Start"]) --> NilCheck{"Is filter empty?"}
NilCheck --> |Yes| PassAll["Return true - match all"]
NilCheck --> |No| MustLoop["Iterate must list"]
MustLoop --> MustEval["Evaluate conditions one by one"]
MustEval --> MustFail{"Any unsatisfied?"}
MustFail --> |Yes| Fail["Return false"]
MustFail --> |No| MustNotLoop["Iterate must_not list"]
MustNotLoop --> MustNotEval["Evaluate conditions one by one"]
MustNotEval --> MustNotFail{"Any satisfied?"}
MustNotFail --> |Yes| Fail
MustNotFail --> |No| Pass["Return true"]
Condition Types and Matching Rules¶
- exact: Strict equality comparison.
- range: Supports integer and floating-point numbers, accepts int/float64 mixed comparison; gt/gte/lt/lte can be combined.
- prefix: Only valid for strings, requires prefix length not to exceed string length.
- contains: Supports []any, []string, []int, string; empty string does not match in string scenario.
- regex: Only valid for strings, invalid regex expressions are considered not matched.
Performance and Storage Optimization¶
Distance Metrics and Sorting¶
- Cosine/Dot: Higher values are more similar; suitable for normalized vectors and direction-sensitive tasks.
- Euclid: Lower values are more similar; suitable for absolute distance-sensitive tasks.
- Default metric is Cosine.
Index Strategy¶
- Flat: Brute force search, suitable for small-scale or low-latency requirements.
- HNSW: Approximate nearest neighbor, supports large-scale high-dimensional vectors with higher throughput and lower latency.
Persistence and Serialization¶
- Uses Protobuf to serialize PointStruct to bytes, written to BoltDB.
- On read: deserialize to PB message, then convert to Go struct.
- Unsupported types are ignored during persistence; pay attention to data integrity.
Quantization (SQ8)¶
- Optionally enabled to compress vector storage, reducing disk usage.
- Automatically dequantized on load, does not affect retrieval accuracy and performance.
Concurrency and Consistency¶
- Collection uses read-write locks to protect internal state.
- Upsert persists first then updates index; attempts to rollback persistence on failure to maintain consistency.
- Version number is based on nanosecond timestamp, ensuring insertion order and idempotency.
Best Practices and Common Issues¶
Field Design Recommendations¶
- id: Recommend using stable strings (such as business primary key), avoid frequent changes.
- vector: Ensure dimension matches collection configuration; pre-normalized vectors improve Cosine effectiveness.
- payload: Keep key names concise and clear; use range for numeric ranges; prefer exact/prefix/contains/regex for text matching.
Filter Usage Recommendations¶
- Prefer using must to narrow range, reduce use of must_not.
- Build indexes on high-frequency filter keys (if extended in the future).
- Regular expressions have higher complexity, use with caution.
Performance Tuning¶
- Prefer HNSW index for large-scale data.
- Enable SQ8 quantization to reduce storage and IO pressure.
- Control topK and filter complexity, avoid full table scans.
Common Issue Troubleshooting¶
- Insert error "dimension mismatch": Check if collection VectorLen matches vector length.
- Filter not working: Confirm key name is correct and type matches; must_not passes by default for non-existent keys.
- Query timeout: Check index type and parameters; appropriately narrow filter range or reduce topK.
Conclusion¶
GoVector's data model centers on PointStruct/ScoredPoint/Filter/Payload, using Protobuf for cross-language and cross-process stable serialization, combined with BoltDB for persistence. With HNSW index and SQ8 quantization, it achieves high performance and low resource usage in large-scale scenarios. Proper design of id/vector/metadata and filters, following the Upsert/Search/Delete lifecycle, provides stable and reliable retrieval experience.