PostgreSQL全文检索中文分词器配置与优化实践

本文介绍了在PostgreSQL中配置中文全文检索以提升RAG系统检索效果的方法。通过安装pg_textsearch和zhparser扩展，创建中文分词器配置，并基于BM25算法建立全文检索索引。实践表明，该方案能有效支持中文文本检索，通过示例查询验证了索引的使用效果。文中还提供了详细的SQL配置步骤和验证方法，为中文全文检索在RAG系统中的应用提供了可行方案。

MarsBighead

978人浏览 · 2026-01-20 16:10:15

MarsBighead · 2026-01-20 16:10:15 发布

引言

在构建RAG（检索增强生成）系统的过程中，提升检索效率与准确性是一个持续优化的课题。除了常见的嵌入向量检索外，结合全文检索技术能进一步改善系统表现。本文基于PostgreSQL数据库，分享中文全文检索分词器的配置、索引创建与使用实践，记录在真实场景中遇到的问题与解决方案。

一、背景

为了提升RAG系统的检索效果，我们探索了全文检索与向量检索结合的混合检索方案。PostgreSQL内置了强大的全文检索功能，并支持扩展插件实现多语言分词。针对中文场景，我们选用了 zhparser 分词插件，并结合 pg_textsearch 扩展实现基于BM25算法的全文检索索引。

二、环境准备：安装扩展

首先需要安装两个关键扩展：

CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS zhparser;

pg_textsearch：提供基于BM25算法的全文检索支持
zhparser：中文分词解析器，支持对中文文本进行词语切分

三、配置中文分词器

1. 创建全文检索配置

在 pg_catalog 模式下创建中文分词配置：

CREATE TEXT SEARCH CONFIGURATION pg_catalog.chinese (PARSER = zhparser);

2. 添加分词映射

将常见的词性标签映射到简单字典：

ALTER TEXT SEARCH CONFIGURATION pg_catalog.chinese 
ADD MAPPING FOR a, b, c, d, e, f, g, h, i, j, k, l, m, 
                n, o, p, q, r, s, t, u, v, w, x, y, z 
WITH simple;

3. 验证配置

查询所有全文检索配置，确认中文解析器已生效：

SELECT                                                  
    n.nspname as schema_name, 
    c.cfgname as config_name, 
    p.prsname as parser_name
FROM pg_ts_config c
JOIN pg_namespace n ON n.oid = c.cfgnamespace
JOIN pg_ts_parser p ON p.oid = c.cfgparser;

执行结果

scorpio=# SELECT                                                  
    n.nspname as schema_name, 
    c.cfgname as config_name, 
    p.prsname as parser_name
FROM pg_ts_config c
JOIN pg_namespace n ON n.oid = c.cfgnamespace
JOIN pg_ts_parser p ON p.oid = c.cfgparser;
 schema_name | config_name | parser_name 
-------------+-------------+-------------
 pg_catalog  | simple      | default
 pg_catalog  | arabic      | default
 pg_catalog  | armenian    | default
 pg_catalog  | basque      | default
 pg_catalog  | catalan     | default
 pg_catalog  | danish      | default
 pg_catalog  | dutch       | default
 pg_catalog  | english     | default
 pg_catalog  | finnish     | default
 pg_catalog  | french      | default
 pg_catalog  | german      | default
 pg_catalog  | greek       | default
 pg_catalog  | hindi       | default
 pg_catalog  | hungarian   | default
 pg_catalog  | indonesian  | default
 pg_catalog  | irish       | default
 pg_catalog  | italian     | default
 pg_catalog  | lithuanian  | default
 pg_catalog  | nepali      | default
 pg_catalog  | norwegian   | default
 pg_catalog  | portuguese  | default
 pg_catalog  | romanian    | default
 pg_catalog  | russian     | default
 pg_catalog  | serbian     | default
 pg_catalog  | spanish     | default
 pg_catalog  | swedish     | default
 pg_catalog  | tamil       | default
 pg_catalog  | turkish     | default
 pg_catalog  | yiddish     | default
 pg_catalog  | chinese     | zhparser
(30 rows)

输出中应包含 chinese 配置，其解析器为 zhparser。

四、创建全文检索索引

1. 基于中文分词器创建BM25索引

CREATE INDEX idx_chunks_content_bm25_zh 
ON alpha.chunks 
USING bm25 (content) 
WITH (text_config = 'chinese');

执行结果

scorpio=# CREATE INDEX idx_chunks_content_bm25_zh ON alpha.chunks 
USING bm25 (content) 
WITH (text_config = 'chinese');
NOTICE:  BM25 index build started for relation idx_chunks_content_bm25_zh
NOTICE:  Using text search configuration: chinese
NOTICE:  Using index options: k1=1.20, b=0.75
NOTICE:  BM25 index build completed: 64 documents, avg_length=194.86, text_config='chinese' (k1=1.20, b=0.75)
CREATE INDEX

系统会输出构建过程的详细日志，包括使用的分词配置、文档数量、平均文档长度以及BM25参数（k1=1.20, b=0.75）。

2. 同时创建英文分词器索引（可选对比）

CREATE INDEX idx_chunks_content_bm25_en 
ON alpha.chunks 
USING bm25(content) 
WITH (text_config='english');

英文分词器为PostgreSQL内置分词器，所以无需额外配置，索引创建非常顺利。

五、验证全文检索效果

执行中文全文检索查询示例：

SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;

执行结果

scorpio=# SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;
 id  |                                       left                                        |   score    
-----+-----------------------------------------------------------------------------------+------------
 216 | # RAG系统介绍                                                                    +| 0.51396555
     |                                                                                  +| 
     | ## 什么是RAG？                                                                   +| 
     |                                                                                  +| 
     | RAG（Retrieval-Augmented Generation，检索增强生成）是一种结合了信息检索和文本生成 | 
(1 row)

scorpio=#

该查询会返回包含“什么是RAG”的文档片段，并按相关度排序。

通过 EXPLAIN 可查看查询执行计划，确认是否走索引扫描：

scorpio=# EXPLAIN (ANALYZE) 
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=33.06..33.07 rows=1 width=44) (actual time=13.349..13.351 rows=1 loops=1)
   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on chunks  (cost=0.00..33.05 rows=1 width=44) (actual time=13.211..13.315 rows=1 loops=1)
         Filter: (to_tsvector('chinese'::regconfig, content) @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
         Rows Removed by Filter: 63
 Planning Time: 0.482 ms
 Execution Time: 13.391 ms
(8 rows)

--强制使用使用bm25索引执行计划

scorpio=# EXPLAIN (ANALYZE)
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE content @@ phraseto_tsquery('chinese', '什么是RAG')  -- 直接使用content
ORDER BY score DESC;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=32.91..32.91 rows=1 width=44) (actual time=13.940..13.941 rows=0 loops=1)
   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on chunks  (cost=0.00..32.90 rows=1 width=44) (actual time=13.723..13.723 rows=0 loops=1)
         Filter: (content @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
         Rows Removed by Filter: 64
 Planning Time: 65.847 ms
 Execution Time: 14.656 ms
(8 rows)

由于数据量小（或者索引不适用），优化器选择了顺序扫描，实际上索引是能够被使用的。

六、关键问题与解决方案

🔧 分词器配置必须位于 `pg_catalog`

在配置过程中，如果尝试在其他schema下创建分词配置，可能会在创建索引时失败。必须将 TEXT SEARCH CONFIGURATION 创建在 pg_catalog 模式下，否则 pg_textsearch 扩展无法识别该配置。

🔧 删除错误的配置

如果分词器配置有误（如 chinese_zh 配置在schema public中），可使用以下命令清理：

DROP TEXT SEARCH CONFIGURATION IF EXISTS chinese_zh CASCADE;

🔧 分词器配置位置

在同一个PostgreSQL实例不同数据库中验证中文分词器配置信息

数据库scorpio中中文分词器配置信息

scorpio=# \dF
               List of text search configurations
   Schema   |    Name    |              Description              
------------+------------+---------------------------------------
 pg_catalog | arabic     | configuration for arabic language
 pg_catalog | armenian   | configuration for armenian language
 pg_catalog | basque     | configuration for basque language
 pg_catalog | catalan    | configuration for catalan language
 pg_catalog | chinese    | 
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | greek      | configuration for greek language
 pg_catalog | hindi      | configuration for hindi language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | indonesian | configuration for indonesian language
 pg_catalog | irish      | configuration for irish language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | lithuanian | configuration for lithuanian language
 pg_catalog | nepali     | configuration for nepali language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | serbian    | configuration for serbian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | tamil      | configuration for tamil language
 pg_catalog | turkish    | configuration for turkish language
 pg_catalog | yiddish    | configuration for yiddish language
(30 rows)

scorpio=# \dF+ chinese
Text search configuration "pg_catalog.chinese"
Parser: "public.zhparser"
 Token | Dictionaries 
-------+--------------
 a     | simple
 b     | simple
 c     | simple
 d     | simple
 e     | simple
 f     | simple
 g     | simple
 h     | simple
 i     | simple
 j     | simple
 k     | simple
 l     | simple
 m     | simple
 n     | simple
 o     | simple
 p     | simple
 q     | simple
 r     | simple
 s     | simple
 t     | simple
 u     | simple
 v     | simple
 w     | simple
 x     | simple
 y     | simple
 z     | simple

数据库hbu中中文分词器配置信息

scorpio=# \c hbu
You are now connected to database "hbu" as user "hbu".
hbu=# \dF
               List of text search configurations
   Schema   |    Name    |              Description              
------------+------------+---------------------------------------
 pg_catalog | arabic     | configuration for arabic language
 pg_catalog | armenian   | configuration for armenian language
 pg_catalog | basque     | configuration for basque language
 pg_catalog | catalan    | configuration for catalan language
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | greek      | configuration for greek language
 pg_catalog | hindi      | configuration for hindi language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | indonesian | configuration for indonesian language
 pg_catalog | irish      | configuration for irish language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | lithuanian | configuration for lithuanian language
 pg_catalog | nepali     | configuration for nepali language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | serbian    | configuration for serbian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | tamil      | configuration for tamil language
 pg_catalog | turkish    | configuration for turkish language
 pg_catalog | yiddish    | configuration for yiddish language
 public     | chinese    | 
(30 rows)

hbu=# \dF+ chinese
Text search configuration "public.chinese"
Parser: "public.zhparser"
 Token | Dictionaries 
-------+--------------
 a     | simple
 e     | simple
 i     | simple
 j     | simple
 l     | simple
 m     | simple
 n     | simple
 t     | simple
 v     | simple
 x     | simple

由于PostgreSQL中文分词器配置chinese是关联数据库（PostgreSQL语境中的数据库）的，另一个数据库中无法使用该配置，但可以在数据库下不同schema共享使用。