集合管理¶

集合是 GoVector 中的主要组织单位，类似于传统数据库中的表。本文档解释如何有效地创建、配置和管理集合。

📋 集合概述¶

集合是具有特定配置参数的向量容器。每个集合都有自己的：

向量配置（维度、距离度量）
索引策略（HNSW 或 Flat）
量化设置（SQ8 压缩）
存储位置（本地目录）

🚀 创建集合¶

嵌入式模式¶

import (
    "github.com/yourusername/govector/core"
)

// 使用自定义配置创建新集合
collection, err := core.NewCollection(core.CollectionConfig{
    Name:       "my-collection",
    VectorLen:  768,             // 向量维度
    Metric:     core.Cosine,      // 距离度量（Cosine, Euclidean, Dot）
    IndexType:  core.HNSW,        // 索引类型（HNSW 或 Flat）
    Quantize:   false,            // 启用/禁 SQ8 量化
    M:          16,               // HNSW M 参数（每个节点的连接数）
    EfConstruction: 200,          // HNSW 构建参数
    EfSearch:   10,               // HNSW 搜索参数
})
if err != nil {
    log.Fatalf("创建集合失败: %v", err)
}

微服务模式¶

curl -X POST http://localhost:6333/collections \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-collection",
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    },
    "hnsw_config": {
      "m": 16,
      "ef_construction": 200,
      "ef": 10
    },
    "quantization_config": {
      "enabled": false
    }
  }'

🔧 集合配置¶

核心参数¶

参数	描述	默认值	范围
`VectorLen`	向量维度	-	1-∞
`Metric`	距离度量	`Cosine`	`Cosine`, `Euclidean`, `Dot`
`IndexType`	索引类型	`HNSW`	`HNSW`, `Flat`
`Quantize`	启用 SQ8 量化	`false`	`true`, `false`

HNSW 索引参数¶

参数	描述	默认值	范围
`M`	每个节点的连接数	16	4-64
`EfConstruction`	构建过程中动态候选列表的大小	200	10-∞
`EfSearch`	搜索过程中动态候选列表的大小	10	10-∞

📁 持久化¶

保存集合¶

// 保存集合到磁盘
if err := collection.Save("/path/to/collection"); err != nil {
    log.Fatalf("保存集合失败: %v", err)
}

加载集合¶

// 从磁盘加载集合
loadedCollection, err := core.LoadCollection("/path/to/collection")
if err != nil {
    log.Fatalf("加载集合失败: %v", err)
}

📊 集合操作¶

写入点¶

// 写入单个点
point := core.PointStruct{
    ID:     "doc_1",
    Vector: []float32{0.1, 0.2, 0.3, /* ... */},
    Payload: core.Payload{
        "category": "tech",
    },
}

if err := collection.Upsert([]core.PointStruct{point}); err != nil {
    log.Fatalf("写入点失败: %v", err)
}

// 写入多个点
points := []core.PointStruct{
    // ... 多个点 ...
}

if err := collection.Upsert(points); err != nil {
    log.Fatalf("写入点失败: %v", err)
}

搜索点¶

// 使用查询向量搜索
queryVector := []float32{0.1, 0.2, 0.3, /* ... */}
results, err := collection.Search(queryVector, nil, 10)
if err != nil {
    log.Fatalf("搜索失败: %v", err)
}

// 使用过滤器搜索
filter := &core.Filter{
    Must: []core.Condition{
        {
            Key:   "category",
            Type:  core.MatchTypeExact,
            Match: core.MatchValue{Value: "tech"},
        },
    },
}

results, err = collection.Search(queryVector, filter, 10)
if err != nil {
    log.Fatalf("使用过滤器搜索失败: %v", err)
}

删除点¶

// 通过 ID 删除
ids := []string{"doc_1", "doc_2"}
deleted, err := collection.Delete(ids, nil)
if err != nil {
    log.Fatalf("删除点失败: %v", err)
}
fmt.Printf("删除了 %d 个点\n", deleted)

// 通过过滤器删除
deleted, err = collection.Delete(nil, filter)
if err != nil {
    log.Fatalf("通过过滤器删除点失败: %v", err)
}
fmt.Printf("通过过滤器删除了 %d 个点\n", deleted)

获取点¶

// 通过 ID 获取点
ids := []string{"doc_1", "doc_2"}
points, err := collection.Get(ids)
if err != nil {
    log.Fatalf("获取点失败: %v", err)
}

for _, point := range points {
    fmt.Printf("ID: %s, Vector: %v\n", point.ID, point.Vector)
}

📈 集合统计信息¶

获取集合信息¶

// 获取集合信息
info, err := collection.Info()
if err != nil {
    log.Fatalf("获取集合信息失败: %v", err)
}

fmt.Printf("集合: %s\n", info.Name)
fmt.Printf("向量长度: %d\n", info.VectorLen)
fmt.Printf("度量: %s\n", info.Metric)
fmt.Printf("索引类型: %s\n", info.IndexType)
fmt.Printf("点数量: %d\n", info.PointCount)

微服务模式¶

# 获取集合信息
curl http://localhost:6333/collections/my-collection

# 列出所有集合
curl http://localhost:6333/collections

🗑️ 删除集合¶

嵌入式模式¶

// 删除集合（从内存中移除，但不从磁盘中删除）
// 要完全移除，在关闭后删除目录

微服务模式¶

# 删除集合
curl -X DELETE http://localhost:6333/collections/my-collection

💡 最佳实践¶

集合设计¶

向量维度：根据您的嵌入模型选择适当的维度（例如，BERT 基于模型使用 768）
距离度量：对归一化向量使用 Cosine，对非归一化向量使用 Euclidean
索引选择：对大型数据集使用 HNSW，对小型数据集使用 Flat（< 10,000 点）
量化：对大型数据集启 SQ8 以减少内存和存储使用

性能优化¶

批处理操作：对多个点使用批量写入以提高性能
索引参数：根据数据集大小调整 HNSW 参数：
小型数据集（10,000-100,000 点）：M=12, EfConstruction=100
中型数据集（100,000-1,000,000 点）：M=16, EfConstruction=200
大型数据集（1,000,000+ 点）：M=24, EfConstruction=400

存储管理¶

备份：定期备份集合目录
磁盘空间：监控磁盘使用情况，尤其是大型集合
加载时间：大型集合可能需要时间加载，因为需要重建索引

🚩 常见问题¶

维度不匹配¶

错误："dimension mismatch"
解决方案：确保所有向量的维度与集合配置相同

内存不足¶

错误："out of memory"
解决方案：启 SQ8 量化，对小型数据集使用 Flat 索引，或减小集合大小

搜索性能缓慢¶

原因：HNSW 参数不当或结果集过大
解决方案：增加 EfSearch 参数，减少 topK，或使用更具选择性的过滤器

🔗 相关文档¶

数据模型 - 核心数据结构
HNSW 索引 - HNSW 索引实现
存储引擎 - 持久化和序列化
使用模式 - 使用 GoVector 的不同方式