Elasticsearch倒排索引原理

约 1813 字大约 6 分钟

elasticsearchinverted-index

2025-05-15

倒排索引(Inverted Index)是 Elasticsearch 实现全文搜索的核心数据结构。与传统数据库的 B-tree 索引从文档找词不同，倒排索引从词找文档，这使得全文搜索的效率极高。

倒排索引基本结构

核心组件

Term Dictionary (词项字典)

Term Dictionary 存储所有唯一的词项(Term)，并维护到 Postings List 的映射。

为了高效查找 Term，Elasticsearch 使用了两层索引结构：

Term Index (FST)：内存中的前缀树，快速定位 Term 在 Term Dictionary 中的大致位置
Term Dictionary：磁盘上的有序 Term 块，通过 Term Index 定位后顺序扫描

FST (Finite State Transducer)

FST 是一种高效的有限状态转换器，用作 Term Index。它在内存中压缩存储所有 Term 的前缀，占用空间极小。

FST 特点：

共享前缀和后缀，极度压缩
常驻内存，查询速度极快
输出的是 Term Dictionary 的 Block 偏移量

Postings List (倒排列表)

每个 Term 对应一个 Postings List，包含：

组件	说明	用途
Doc IDs	包含该 Term 的文档 ID 列表	文档匹配
Term Frequency (TF)	该 Term 在每个文档中的出现次数	相关性评分
Position	Term 在文档中的位置	短语查询/高亮
Offset	Term 在文档中的字符偏移量	高亮

// Term "redis" 的 Postings List
{
  "doc_ids": [1, 3, 7, 15, 23],
  "term_frequencies": [2, 1, 3, 1, 2],
  "positions": {
    "1": [0, 15],
    "3": [0],
    "7": [3, 8, 20],
    "15": [5],
    "23": [0, 12]
  }
}

Postings List 压缩

Elasticsearch 使用多种压缩算法减少 Postings List 的存储开销：

Roaring Bitmap：当 Postings List 非常长时，Elasticsearch 使用 Roaring Bitmap 进行高效的集合运算(AND/OR)。

Doc Values

Doc Values 是倒排索引的补充，提供列式存储用于排序和聚合。

// 映射配置
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"  // 自动启用 doc_values
          }
        }
      },
      "description": {
        "type": "text",
        "doc_values": false  // text 类型默认不启用 doc_values
      },
      "price": {
        "type": "float"  // 数值类型默认启用 doc_values
      }
    }
  }
}

Segment 结构

Elasticsearch 的索引由多个 Segment 组成，每个 Segment 是一个完整的倒排索引，且一旦写入不可修改(immutable)。

Segment 不可变性

优势	说明
无需锁	读取不需要加锁
缓存友好	可以被 OS 文件系统缓存
压缩率高	已知数据不变，可以深度压缩

代价	说明
删除是标记	在 .del 文件标记，空间未释放
更新 = 删除 + 新增	写放大
Segment 数量增长	需要定期合并

Segment Merge (合并)

// 合并策略配置
{
  "settings": {
    "index": {
      "merge": {
        "policy": {
          "max_merge_at_once": 10,
          "max_merged_segment": "5gb",
          "segments_per_tier": 10
        },
        "scheduler": {
          "max_thread_count": 1
        }
      }
    }
  }
}

强制合并

# 强制合并到指定数量的 Segment（用于只读索引）
POST /my-index/_forcemerge?max_num_segments=1

# 查看 Segment 信息
GET /my-index/_segments

近实时搜索 (Near Real-Time)

Elasticsearch 不是严格的实时搜索，而是"近实时"(NRT)。文档写入后需要经过 Refresh 才能被搜索到。

// Refresh 间隔配置
{
  "settings": {
    "index": {
      "refresh_interval": "1s"  // 默认 1 秒
    }
  }
}

// 大量写入时临时关闭 Refresh 提升性能
// PUT /my-index/_settings
{
  "index": {
    "refresh_interval": "-1"
  }
}

// 手动触发 Refresh
// POST /my-index/_refresh

分析器 (Analyzer) 与索引过程

文档在建立倒排索引前需要经过分析器处理：

// 自定义分析器
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  }
}

# 测试分析器效果
POST /_analyze
{
  "analyzer": "standard",
  "text": "The Quick Brown Fox jumps over the Lazy Dog"
}
# 输出: [the, quick, brown, fox, jumps, over, the, lazy, dog]

性能优化

索引时优化

{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "index_options": "docs"  // 仅存储 doc_id，不存储 TF/Position
        // docs: 仅 doc_id
        // freqs: doc_id + TF
        // positions: doc_id + TF + Position (默认)
        // offsets: 全部
      },
      "log_message": {
        "type": "text",
        "norms": false  // 不需要评分的字段关闭 norms
      },
      "status_code": {
        "type": "keyword",
        "doc_values": true,
        "index": true
      }
    }
  }
}

查询时优化

// 避免 wildcard 前缀匹配
// 不推荐
{ "wildcard": { "name": "*smith" } }

// 推荐：使用 reverse token filter + 前缀匹配
// 或者使用 ngram tokenizer

总结

倒排索引由 Term Dictionary + Postings List 组成，是全文搜索的基础
FST 作为 Term Index 常驻内存，实现快速的 Term 查找
Doc Values 提供列式存储，用于排序和聚合
Segment 不可变性带来了并发安全和缓存友好的优势，代价是需要定期合并
Refresh 机制导致搜索是"近实时"的(默认 1 秒延迟)
合理配置分析器和字段映射是优化搜索性能的关键

贡献者

withesse

更新日志

2026/3/14 13:09

查看所有更新日志

9f6c2-feat: organize wiki content and refresh site setup于 2026/3/14