Data Models¶
This document systematically outlines GoVector's data models, covering core data structures (PointStruct, ScoredPoint, Filter, Payload, CollectionMeta, etc.), explaining their field definitions, data types, constraints, and use cases; detailing Protocol Buffers binary format and JSON mapping relationships; describing vector data storage format, serialization mechanism, and performance considerations; providing relationship diagrams between data models and conversion rules, and offering practical data examples to aid understanding.
Project Structure and Data Model Overview¶
- Core data models located in core package:
- models.go: Defines Payload, PointStruct, ScoredPoint, Filter, RangeValue, MatchValue, CollectionMeta and filter matching logic
- proto/point.proto: Defines Protobuf message structures (PointStruct, Value, ScoredPoint, Condition, MatchValue, Filter)
- proto/point.pb.go: Protobuf Go code generated by protoc-gen-go
- storage.go: bbolt-based persistence layer, responsible for serializing PointStruct to Protobuf and writing to disk
- collection.go: Collection abstraction encapsulating Upsert/Search/Delete operations and coordinating storage and index
- math.go: Distance metric enumeration and calculation functions (Cosine, Euclid, Dot)
- hnsw_index.go: HNSW parameters and index implementation
- models_test.go: Unit tests for filter and matching logic, demonstrating typical usage
- README.md: Project features and usage instructions
graph TB
subgraph "Core Data Models"
A["Payload
String key to arbitrary value"]
B["PointStruct
ID/Version/Vector/Payload"]
C["ScoredPoint
ID/Version/Score/Payload"]
D["Filter
Must/MustNot conditions"]
E["Condition
Key/Type/Match/Range"]
F["MatchValue
oneof string/int/float/bool/bytes"]
G["RangeValue
GT/GTE/LT/LTE"]
H["CollectionMeta
Name/Dimension/Metric/HNSW params"]
end
subgraph "Serialization and Storage"
P["Protobuf Messages
PointStruct/Value/ScoredPoint/Condition/MatchValue/Filter"]
S["Storage
bbolt + Protobuf"]
end
subgraph "Index and Collection"
I["Collection
Upsert/Search/Delete"]
J["HNSWIndex
Parameters and distance functions"]
end
B --> P
C --> P
D --> E
E --> F
E --> G
P --> S
I --> S
I --> J
Core Data Structures Overview¶
- Payload: String key to arbitrary value metadata mapping, used for filtering and retrieval
- PointStruct: Vector point entity containing unique identifier, version number, vector array, and optional metadata
- ScoredPoint: Query result containing similarity/distance score and point information
- Filter: Query filter supporting Must and MustNot condition sets
- Condition: Single filter condition specifying key, type, and match value or range
- MatchValue: Match value using oneof to support multiple basic types
- RangeValue: Range value supporting greater than/greater than or equal/less than/less than or equal
- CollectionMeta: Collection metadata recording collection name, vector dimension, distance metric, whether HNSW is used, and parameters
Architecture and Data Flow Overview¶
- Write flow: Application calls Collection.Upsert → Validate dimension and set version → Storage layer writes to bbolt first (Protobuf serialization) → Update in-memory index (Flat or HNSW)
- Query flow: Collection.Search → Lock for reading → Call index Search → Return ScoredPoint list
- Delete flow: Collection.Delete → Determine target IDs (explicit ID or by filter matching) → Delete from storage first then from index
- Filter matching: MatchFilter iterates through Must/MustNot conditions, calling matchCondition for each → Branch processing for exact/range/prefix/contains/regex
sequenceDiagram
participant App as "Application"
participant Col as "Collection"
participant St as "Storage"
participant Idx as "VectorIndex(HNSW/Flat)"
App->>Col : Upsert(points)
Col->>Col : Validate dimension/set version
Col->>St : UpsertPoints(Protobuf serialization)
St-->>Col : Success/failure
Col->>Idx : Upsert(points)
Idx-->>Col : Success/failure
Col-->>App : Return
App->>Col : Search(query, filter, topK)
Col->>Idx : Search(query, filter, topK)
Idx-->>Col : []ScoredPoint
Col-->>App : []ScoredPoint
Detailed Data Model Analysis¶
Payload (Metadata)¶
- Type: map[string]interface{}
- Constraints: Keys must be strings; values can be strings, integers, floats, booleans, byte slices, or arrays of strings/integers (recognized in filters)
- Use cases: Used as matching objects in Filter, supporting multi-condition combinations
- JSON mapping: Directly mapped to JSON object
PointStruct (Vector Point)¶
- Fields
- ID: string, unique identifier
- Version: uint64, version number (nanosecond-level timestamp)
- Vector: []float32, vector embedding
- Payload: Payload, optional metadata
- Constraints
- Vector length must match collection dimension
- Version number is uniformly set during batch Upsert
- Use cases: Insert/update vector points; target for Protobuf serialization
ScoredPoint (Scored Result)¶
- Fields
- ID: string
- Version: uint64
- Score: float32, similarity or distance score (determined by collection metric - "smaller/larger is more similar")
- Payload: Payload
- Constraints: Score value returned by underlying index; sorting follows metric convention
- Use cases: Search query return results
Filter¶
- Fields
- Must: []Condition, all must be satisfied
- MustNot: []Condition, none can be satisfied
- Constraints: When both Must and MustNot exist, both must be satisfied for a hit
- Use cases: Filter by metadata during Search/Delete
Condition¶
- Fields
- Key: string, metadata key
- Type: ConditionType, match type (exact/range/prefix/contains/regex)
- Match: MatchValue, exact match value
- Range: *RangeValue, range match
- Constraints: Use Range when Type is range; otherwise use Match
- Use cases: Describe single key's filter rule
MatchValue¶
- oneof support: string/int/float/bool/bytes
- Use cases: Value carrier for exact matching in Filter
RangeValue¶
- Fields: GT/GTE/LT/LTE, supports int and float64
- Use cases: Numeric range filtering
CollectionMeta (Collection Metadata)¶
- Fields
- Name: string
- VectorLen: int
- Metric: Distance
- UseHNSW: bool
- HNSWParams: HNSWParams
- Use cases: Persist collection configuration, recover after restart
Protocol Buffers and JSON Mapping¶
Protobuf Message Definition and Field Mapping¶
- PointStruct
- id: string
- vector: repeated float
- payload: map
- Value (oneof)
- stringvalue, intvalue, doublevalue, boolvalue, bytes_value
- ScoredPoint
- id: string
- version: uint64
- score: float
- payload: map
- Condition
- key: string
- match: MatchValue
- MatchValue (oneof)
- stringvalue, intvalue, doublevalue, boolvalue, bytes_value
- Filter
- must: []Condition
- must_not: []Condition
JSON Mapping Rules¶
- Protobuf field names and JSON names are usually consistent (e.g., id, version, score, must, must_not)
- map
appears as an object in JSON with string keys and Value objects as values - oneof fields in JSON only output the field that is set (e.g., stringValue, intValue, etc.)
Go Struct and Protobuf Conversion¶
- toProtoPoint/fromProtoPoint: Maps core.Payload to pb.Value.oneof, and vice versa
- Note: Go Payload's interface{} values are converted by selecting the appropriate oneof field based on concrete type during conversion; unsupported types are skipped
Vector Storage and Serialization Mechanism¶
Storage Engine and Container¶
- bbolt (BoltDB): One bucket per collection, key is point ID, value is Protobuf-serialized bytes
- Metadata bucket: collections_meta, stores CollectionMeta (JSON)
Serialization and Deserialization¶
- Write: PointStruct → toProtoPoint → proto.Marshal → bbolt.Put
- Read: bbolt.Get → proto.Unmarshal → fromProtoPoint
- Quantization: Optional SQ8 quantization compresses vectors during write and stores in Payload; decompresses to restore during read
Version Control and Consistency¶
- Upsert batch write uniformly sets Version (nanosecond-level timestamp)
- Write order: Storage first, then index; if index update fails, attempt rollback of storage (best effort)
Data Model Relationship Diagram and Conversion Rules¶
classDiagram
class Payload {
+map[string]interface{}
}
class PointStruct {
+string ID
+uint64 Version
+[]float32 Vector
+Payload Payload
}
class ScoredPoint {
+string ID
+uint64 Version
+float32 Score
+Payload Payload
}
class Filter {
+[]Condition Must
+[]Condition MustNot
}
class Condition {
+string Key
+ConditionType Type
+MatchValue Match
+RangeValue Range
}
class MatchValue {
+oneof Value
}
class RangeValue {
+interface{} GT
+interface{} GTE
+interface{} LT
+interface{} LTE
}
class CollectionMeta {
+string Name
+int VectorLen
+Distance Metric
+bool UseHNSW
+HNSWParams HNSWParams
}
PointStruct --> Payload : "contains"
ScoredPoint --> Payload : "contains"
Filter --> Condition : "contains"
Condition --> MatchValue : "uses"
Condition --> RangeValue : "uses"
Conversion Rules¶
- Go Payload ↔ Protobuf Value.oneof
- string/int/int64/float32/float64/bool/[]byte map to corresponding oneof fields
- Unsupported types are ignored
- CollectionMeta stored as JSON in collections_meta bucket
Performance Considerations and Optimization Suggestions¶
- Distance Metrics
- Cosine: Normalized, suitable for vectors of different scales; HNSW uses 1 - cos directly as distance by default
- Euclid: L2 distance, smaller is more similar; HNSW uses standard Euclidean distance
- Dot: Inner product, larger is more similar; HNSW converts to negative value to adapt to minimization strategy
- HNSW Parameters
- M: Maximum connections per node
- EfConstruction/EfSearch: Construction/search candidate list size
- K: Number of neighbors to return
- Quantization
- SQ8 quantization can significantly reduce disk usage; automatically decompressed on load; suitable for large-scale data
- Write Order
- Storage first, then index, ensuring storage-priority consistency; best-effort rollback when index fails
Troubleshooting Guide¶
- Dimension mismatch
- Symptom: Upsert/Search errors "dimension mismatch"
- Solution: Ensure PointStruct.Vector length matches collection VectorLen
- Filter not working
- Symptom: Filter not taking effect or misjudging
- Diagnosis: Confirm Key exists in Payload; RangeValue type matches value type (int/float64)
- Regular expression error
- Symptom: matchRegex returns false
- Solution: Check regex syntax; invalid regex will directly return non-match
- Storage closed or unavailable
- Symptom: UpsertPoints/LoadCollection error "storage is closed"
- Solution: Ensure Storage is correctly opened and not closed
Conclusion¶
GoVector's data models revolve around PointStruct, ScoredPoint, Filter, Payload, and CollectionMeta, combined with Protobuf's strong typing and efficient serialization capabilities, as well as bbolt's lightweight persistence, to achieve a high-performance, scalable embedded vector database. Through clear field definitions, strict constraints, and complete filter matching logic, users can quickly build vector retrieval and filtering capabilities on local or edge devices.
Appendix: Data Examples¶
The following examples show complete representations of each data structure (presented in JSON format for easy understanding of fields and values):
- PointStruct Example
- Fields: id, version, vector, payload
- Payload example key-values: category(string), price(int), in_stock(bool), tags([]string), name(string)
-
Reference path: models_test.go:8-15
-
ScoredPoint Example
- Fields: id, version, score, payload
-
Reference path: collection.go:135-147
-
Filter/Condition/MatchValue/RangeValue Example
- Filter: must/must_not arrays, each element is Condition
- Condition: key, type, match or range
- MatchValue: oneof fields (stringvalue/intvalue/doublevalue/boolvalue/bytes_value)
- RangeValue: gt/gte/lt/lte
-
Reference path: models_test.go:132-218, point.proto:30-49
-
CollectionMeta Example
- Fields: name, vector_size, distance, hnsw, parameters
- Reference path: collection.go:56-66