引言

在构建RAG(检索增强生成)系统的过程中,提升检索效率与准确性是一个持续优化的课题。除了常见的嵌入向量检索外,结合全文检索技术能进一步改善系统表现。本文基于PostgreSQL数据库,分享中文全文检索分词器的配置、索引创建与使用实践,记录在真实场景中遇到的问题与解决方案。


一、背景

为了提升RAG系统的检索效果,我们探索了全文检索与向量检索结合的混合检索方案。PostgreSQL内置了强大的全文检索功能,并支持扩展插件实现多语言分词。针对中文场景,我们选用了 zhparser 分词插件,并结合 pg_textsearch 扩展实现基于BM25算法的全文检索索引。


二、环境准备:安装扩展

首先需要安装两个关键扩展:

CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS zhparser;
  • pg_textsearch:提供基于BM25算法的全文检索支持
  • zhparser:中文分词解析器,支持对中文文本进行词语切分

三、配置中文分词器

1. 创建全文检索配置

pg_catalog 模式下创建中文分词配置:

CREATE TEXT SEARCH CONFIGURATION pg_catalog.chinese (PARSER = zhparser);

2. 添加分词映射

将常见的词性标签映射到简单字典:

ALTER TEXT SEARCH CONFIGURATION pg_catalog.chinese 
ADD MAPPING FOR a, b, c, d, e, f, g, h, i, j, k, l, m, 
                n, o, p, q, r, s, t, u, v, w, x, y, z 
WITH simple;

3. 验证配置

查询所有全文检索配置,确认中文解析器已生效:

SELECT                                                  
    n.nspname as schema_name, 
    c.cfgname as config_name, 
    p.prsname as parser_name
FROM pg_ts_config c
JOIN pg_namespace n ON n.oid = c.cfgnamespace
JOIN pg_ts_parser p ON p.oid = c.cfgparser;

执行结果

scorpio=# SELECT                                                  
    n.nspname as schema_name, 
    c.cfgname as config_name, 
    p.prsname as parser_name
FROM pg_ts_config c
JOIN pg_namespace n ON n.oid = c.cfgnamespace
JOIN pg_ts_parser p ON p.oid = c.cfgparser;
 schema_name | config_name | parser_name 
-------------+-------------+-------------
 pg_catalog  | simple      | default
 pg_catalog  | arabic      | default
 pg_catalog  | armenian    | default
 pg_catalog  | basque      | default
 pg_catalog  | catalan     | default
 pg_catalog  | danish      | default
 pg_catalog  | dutch       | default
 pg_catalog  | english     | default
 pg_catalog  | finnish     | default
 pg_catalog  | french      | default
 pg_catalog  | german      | default
 pg_catalog  | greek       | default
 pg_catalog  | hindi       | default
 pg_catalog  | hungarian   | default
 pg_catalog  | indonesian  | default
 pg_catalog  | irish       | default
 pg_catalog  | italian     | default
 pg_catalog  | lithuanian  | default
 pg_catalog  | nepali      | default
 pg_catalog  | norwegian   | default
 pg_catalog  | portuguese  | default
 pg_catalog  | romanian    | default
 pg_catalog  | russian     | default
 pg_catalog  | serbian     | default
 pg_catalog  | spanish     | default
 pg_catalog  | swedish     | default
 pg_catalog  | tamil       | default
 pg_catalog  | turkish     | default
 pg_catalog  | yiddish     | default
 pg_catalog  | chinese     | zhparser
(30 rows)

输出中应包含 chinese 配置,其解析器为 zhparser


四、创建全文检索索引

1. 基于中文分词器创建BM25索引

CREATE INDEX idx_chunks_content_bm25_zh 
ON alpha.chunks 
USING bm25 (content) 
WITH (text_config = 'chinese');

执行结果

scorpio=# CREATE INDEX idx_chunks_content_bm25_zh ON alpha.chunks 
USING bm25 (content) 
WITH (text_config = 'chinese');
NOTICE:  BM25 index build started for relation idx_chunks_content_bm25_zh
NOTICE:  Using text search configuration: chinese
NOTICE:  Using index options: k1=1.20, b=0.75
NOTICE:  BM25 index build completed: 64 documents, avg_length=194.86, text_config='chinese' (k1=1.20, b=0.75)
CREATE INDEX

系统会输出构建过程的详细日志,包括使用的分词配置、文档数量、平均文档长度以及BM25参数(k1=1.20, b=0.75)。

2. 同时创建英文分词器索引(可选对比)

CREATE INDEX idx_chunks_content_bm25_en 
ON alpha.chunks 
USING bm25(content) 
WITH (text_config='english');

英文分词器为PostgreSQL内置分词器,所以无需额外配置,索引创建非常顺利。


五、验证全文检索效果

执行中文全文检索查询示例:

SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;

执行结果

scorpio=# SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;
 id  |                                       left                                        |   score    
-----+-----------------------------------------------------------------------------------+------------
 216 | # RAG系统介绍                                                                    +| 0.51396555
     |                                                                                  +| 
     | ## 什么是RAG?                                                                   +| 
     |                                                                                  +| 
     | RAG(Retrieval-Augmented Generation,检索增强生成)是一种结合了信息检索和文本生成 | 
(1 row)

scorpio=# 

该查询会返回包含“什么是RAG”的文档片段,并按相关度排序。

通过 EXPLAIN 可查看查询执行计划,确认是否走索引扫描:

scorpio=# EXPLAIN (ANALYZE) 
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=33.06..33.07 rows=1 width=44) (actual time=13.349..13.351 rows=1 loops=1)
   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on chunks  (cost=0.00..33.05 rows=1 width=44) (actual time=13.211..13.315 rows=1 loops=1)
         Filter: (to_tsvector('chinese'::regconfig, content) @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
         Rows Removed by Filter: 63
 Planning Time: 0.482 ms
 Execution Time: 13.391 ms
(8 rows)

--强制使用使用bm25索引执行计划

scorpio=# EXPLAIN (ANALYZE)
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE content @@ phraseto_tsquery('chinese', '什么是RAG')  -- 直接使用content
ORDER BY score DESC;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=32.91..32.91 rows=1 width=44) (actual time=13.940..13.941 rows=0 loops=1)
   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on chunks  (cost=0.00..32.90 rows=1 width=44) (actual time=13.723..13.723 rows=0 loops=1)
         Filter: (content @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
         Rows Removed by Filter: 64
 Planning Time: 65.847 ms
 Execution Time: 14.656 ms
(8 rows)

由于数据量小(或者索引不适用),优化器选择了顺序扫描,实际上索引是能够被使用的。


六、关键问题与解决方案

🔧 分词器配置必须位于 pg_catalog

在配置过程中,如果尝试在其他schema下创建分词配置,可能会在创建索引时失败。必须将 TEXT SEARCH CONFIGURATION 创建在 pg_catalog 模式下,否则 pg_textsearch 扩展无法识别该配置。

🔧 删除错误的配置

如果分词器配置有误(如 chinese_zh 配置在schema public中),可使用以下命令清理:

DROP TEXT SEARCH CONFIGURATION IF EXISTS chinese_zh CASCADE;

🔧 分词器配置位置

在同一个PostgreSQL实例不同数据库中验证中文分词器配置信息

  • 数据库scorpio中中文分词器配置信息
scorpio=# \dF
               List of text search configurations
   Schema   |    Name    |              Description              
------------+------------+---------------------------------------
 pg_catalog | arabic     | configuration for arabic language
 pg_catalog | armenian   | configuration for armenian language
 pg_catalog | basque     | configuration for basque language
 pg_catalog | catalan    | configuration for catalan language
 pg_catalog | chinese    | 
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | greek      | configuration for greek language
 pg_catalog | hindi      | configuration for hindi language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | indonesian | configuration for indonesian language
 pg_catalog | irish      | configuration for irish language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | lithuanian | configuration for lithuanian language
 pg_catalog | nepali     | configuration for nepali language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | serbian    | configuration for serbian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | tamil      | configuration for tamil language
 pg_catalog | turkish    | configuration for turkish language
 pg_catalog | yiddish    | configuration for yiddish language
(30 rows)

scorpio=# \dF+ chinese
Text search configuration "pg_catalog.chinese"
Parser: "public.zhparser"
 Token | Dictionaries 
-------+--------------
 a     | simple
 b     | simple
 c     | simple
 d     | simple
 e     | simple
 f     | simple
 g     | simple
 h     | simple
 i     | simple
 j     | simple
 k     | simple
 l     | simple
 m     | simple
 n     | simple
 o     | simple
 p     | simple
 q     | simple
 r     | simple
 s     | simple
 t     | simple
 u     | simple
 v     | simple
 w     | simple
 x     | simple
 y     | simple
 z     | simple

  • 数据库hbu中中文分词器配置信息
scorpio=# \c hbu
You are now connected to database "hbu" as user "hbu".
hbu=# \dF
               List of text search configurations
   Schema   |    Name    |              Description              
------------+------------+---------------------------------------
 pg_catalog | arabic     | configuration for arabic language
 pg_catalog | armenian   | configuration for armenian language
 pg_catalog | basque     | configuration for basque language
 pg_catalog | catalan    | configuration for catalan language
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | greek      | configuration for greek language
 pg_catalog | hindi      | configuration for hindi language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | indonesian | configuration for indonesian language
 pg_catalog | irish      | configuration for irish language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | lithuanian | configuration for lithuanian language
 pg_catalog | nepali     | configuration for nepali language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | serbian    | configuration for serbian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | tamil      | configuration for tamil language
 pg_catalog | turkish    | configuration for turkish language
 pg_catalog | yiddish    | configuration for yiddish language
 public     | chinese    | 
(30 rows)

hbu=# \dF+ chinese
Text search configuration "public.chinese"
Parser: "public.zhparser"
 Token | Dictionaries 
-------+--------------
 a     | simple
 e     | simple
 i     | simple
 j     | simple
 l     | simple
 m     | simple
 n     | simple
 t     | simple
 v     | simple
 x     | simple

由于PostgreSQL中文分词器 配置chinese是关联数据库(PostgreSQL语境中的数据库)的,另一个数据库中无法使用该配置,但可以在数据库下不同schema共享使用。


七、总结

通过本次配置,我们成功在PostgreSQL中实现了基于 zhparser 的中文全文检索,并结合 pg_textsearch 的BM25算法构建高效检索索引。主要收获如下:

  1. 分词器配置需位于系统schemachinese 全文检索配置必须创建在 pg_catalog 中,否则索引创建会失败。
  2. 中英文分词器可并存:可为同一列创建不同语言的全文检索索引,适用于多语言内容检索场景。
  3. BM25提供可调参数:索引构建时支持调整 k1b 参数,可根据文档集特点进行优化。

该方案为RAG系统提供了稳定、高效的全文检索支持,尤其适用于中文文档的精准召回场景。


本文基于真实配置过程整理,适用于 PostgreSQL 17.7版本,使用 pg_textsearchzhparser 扩展。实际部署中需根据数据规模与查询模式进一步优化索引参数与查询结构。

Logo

助力广东及东莞地区开发者,代码托管、在线学习与竞赛、技术交流与分享、资源共享、职业发展,成为松山湖开发者首选的工作与学习平台

更多推荐