C++之基于正倒排索引的Boost搜索引擎项目searcher部分代码及详解

该代码通过Searcher类实现轻量级搜索引擎后端核心功能：以单例模式构建倒排 / 正向索引，处理用户查询时先分词、转小写查索引，用InvertedElemPrint配合哈希表去重并累加权重，结果按权重排序后取文档信息、生成摘要，最终序列化为 JSON 返回，实现 “关键词→搜索结果” 的完整链路。

啊吧怪不啊吧

2407人浏览 · 2025-10-06 14:44:58

啊吧怪不啊吧 · 2025-10-06 14:44:58 发布

这个searcher.hpp的本质是一种使用其他文件，然后实现自己功能的一种更上层的封装。

它主要实现的是就是他用户的搜索词进行处理，接着根据这个处理结果来返回网页给用户。

1. 单例模式

这边的话我们使用的是单例模式来进行实例化。同时我们建立正倒排索引。

 private:
            ns_index::Index* index;
        public:
            Searcher(){};
            ~Searcher(){};
        public:
            void InitSearcher(const std::string& input)
            {
                //1 创建（获取）一个index对象
                //在这里我们用的是单例模式
                index=ns_index::Index::Getinstance();
                //2根据对象建立索引
                index->BuildIndex(input);
                //std::cout<<"建立索引成功"<<std::endl;
                LOG1(NORMAL,"建立索引成功...");
            }

2. Search

这个函数的主要流程包括分词、触发、合并排序和构建 JSON 结果四个步骤。

2.1 分词

分词的话我们就是先创建一个words数组，然后使用CutSreing函数把用户提供的关键字分词并交给words。

2.2 触发

这一部分的话获取在单例模式中的倒排索引，通过index->GetInvertedList(w)获取关键词w对应的倒排索引列表，然后把他交给tokens_map，又因为tokens_map是哈希结构，所以会自动实现去重。那个to_lower是用来实现小写化的，因为我们不可以说把用户输入的大写和小写区分成两个关键字。

for(const auto &elem : *inverted_list)里面的item是加了&的，所以实际上加的是tokens_map。

2.3 合并

这边就是把处理好的倒排拉链全部交给inverted_list_all，这边之所以要交给它是因为vector访问更方便，接着把他们根据自身的权重来进行从大到小排序。

2.4 构建 JSON

这一步就是序列化，简单来说就是把我们的内容转变成标准的、线性的、可存储 / 可传输的格式。当然最后也需要通过反序列化来读取正确的内容。这边序列化的代码我们不自己写了，而是通过Json的非标准库来实现。把单个序列化的结果交给总的root。

Json::StyledWriter writer; 这行代码的作用是创建一个 “格式化的 JSON 写入器”，用于将内存中的 JSON 对象（如代码中的root）转换为带缩进、换行的可读性强的 JSON 字符串，最后通过write交给json_string这个字符串。

PS:在这份代码里面一会是获取正排索引，一会是获取倒排索引，但是各位有没有发现，他们都和inverted_list_all有关系，inverted_list_all的本质是对多个倒排拉链进行 “去重、合并、排序” 后的候选文档集合，是连接 “索引查询” 和 “结果返回” 的中间数据结构。

//queue是要搜索的关键字
            //json_string是返回给用户的搜索结果
            void Search(const std::string& query,std::string* json_string)
            {
                //1. 分词
                std::vector<std::string>words;
                ns_util::JiebaUsutl::CutString(query,&words);//所以jieba里面的函数来进行分词
                //2. 触发
                //ns_index::InvertedList inverted_list_all;

                std::vector<InvertedElemPrint> inverted_list_all;//用来存放去重后的关键字
                std::unordered_map<uint64_t,InvertedElemPrint> tokens_map;
                //这个哈希的作用是用来去重
                for(std::string w:words)
                {
                    boost::to_lower(w);//这一步的作用是忽略大小写

                    ns_index::InvertedList* inverted_list=index->GetInvertedList(w);

                    if(inverted_list==nullptr)
                        continue;
                    //inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end());
                
                    for(const auto &elem : *inverted_list){
                    //这个for循环就相当于把这个inverted_list里面的值交给这个哈希
                        auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建
                        //注意：这个item是&，所以实际上加的是tokens_map的[elem.doc_id]，这样写是为了避免重复查找，提高效率
                        //item一定是doc_id相同的print节点
                        item.doc_id = elem.doc_id;
                        item.weight += elem.weight;
                        item.words.push_back(elem.word);
                    }
                }
                   //3. 合并
                 for(const auto &item : tokens_map)//这边就是把通过哈希去重分好后的词交给inverted_list_all
                    inverted_list_all.push_back(std::move(item.second));
               /* std::sort(inverted_list_all.begin(),inverted_list_all.end(),
                [](const ns_index::InvertedElem& e1,const ns_index::InvertedElem& e2){//根据weight来从大到小排序
                    return e1.weight>e2.weight;
                    
                });*/
                std::sort(inverted_list_all.begin(), inverted_list_all.end(),
                    [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){
                        return e1.weight > e2.weight;
                });
                //4. 构建
                Json::Value root;
                for(auto& item:inverted_list_all)
                {
                    ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id);
                    if(doc==nullptr)
                        continue;
                    Json::Value elem;//通过json来进行序列化
                    elem["title"]=doc->title;
                    elem["desc"]=GetDesc(doc->content,item.words[0]);
                    elem["url"]=doc->url;//这是通过json然后可以将root序列化为JSON格式的字符串接着搜索结果以JSON形式返回。

                    root.append(elem);
                }
                //
                Json::StyledWriter writer;
                *json_string=writer.write(root);//完成序列化
            }

3. GetDesc

这段代码就是寻找摘要，简单来说就是在html_content里面查找是否出现了关键字word。就相当于我们搜索一个词时，浏览器帮我们选出和这个词相关的网页信息的过程。

下面那几个if条件判断就是在确认这条网页信息的程度够不够，如果不够的话就直接全部拿过来，因为如果说没有怎么长我们还前50后100，那就会越界访问。

 //这个函数的作用就是找摘要，比如说关键词K,那么我们就把K前面和后面的一部分内容作为摘要
            std::string GetDesc(const std::string& html_content,const std::string& word)
            {
                int prev_step=50;
                int next_step=100;

                auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int x,int y){
                    return (std::tolower(x)==std::tolower(y));
                });//在 html_content 这个字符串中查找 word 这个子序列
                //return (std::tolower(x)==std::tolower(y))用来忽略大小写
                if(iter==html_content.end())
                    return "None1";
                int pos=std::distance(html_content.begin(),iter);//计算找到的子序列在 html_content 中的起始位置 pos。
                if(pos==std::string::npos)
                    return  "None1";
                int start=0;
                int end=html_content.size()-1;
                if(pos-prev_step>start)
                    start=pos-prev_step;
                if(pos+next_step<end)
                    end=pos+next_step;
                
                if(start>=end) return "None2";
                std::string desc=html_content.substr(start,end-start);
                desc+="...";
                return desc;

            }

4. InvertedElemPrint

这个结构体的出现就是为了解决重复文档的问题。作为 “去重与信息聚合” 的载体，解决 “同一文档因匹配多个关键词而重复出现” 的问题。

通过 doc_id（文档唯一标识）作为核心依据，配合哈希表（tokens_map）实现 “同一文档只被记录一次”。当用户查询包含多个关键词时，同一文档可能匹配多个关键词并出现在多个倒排拉链中，InvertedElemPrint 借助哈希表的键唯一性，避免文档在结果中重复出现。

struct InvertedElemPrint{//这个结构体的作用就是用来解决重复文档的问题
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;

        InvertedElemPrint():
        doc_id(0),
        weight(0)
        {}
    };

5. 总结

通过 “索引构建” 提前做好数据准备，通过 “查询处理” 高效匹配并排序文档，通过 “结果格式化” 衔接前端展示，最终实现 “用户输入关键词→得到清晰搜索结果” 的完整功能，以下是它的完整代码：

#pragma once
#include"index.hpp"
#include"usuallytool.hpp"
#include<algorithm>
#include<jsoncpp/json/json.h>
#include"log.hpp"

namespace ns_searcher{

    struct InvertedElemPrint{//这个结构体的作用就是用来解决重复文档的问题
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;

        InvertedElemPrint():
        doc_id(0),
        weight(0)
        {}
    };
    class Searcher{
        private:
            ns_index::Index* index;
        public:
            Searcher(){};
            ~Searcher(){};
        public:
            void InitSearcher(const std::string& input)
            {
                //1 创建（获取）一个index对象
                //在这里我们用的是单例模式
                index=ns_index::Index::Getinstance();
                //2根据对象建立索引
                index->BuildIndex(input);
                //std::cout<<"建立索引成功"<<std::endl;
                LOG1(NORMAL,"建立索引成功...");
            }
            //queue是要搜索的关键字
            //json_string是返回给用户的搜索结果
            void Search(const std::string& query,std::string* json_string)
            {
                //1. 分词
                std::vector<std::string>words;
                ns_util::JiebaUsutl::CutString(query,&words);//所以jieba里面的函数来进行分词
                //2. 触发
                //ns_index::InvertedList inverted_list_all;

                std::vector<InvertedElemPrint> inverted_list_all;//用来存放去重后的关键字
                std::unordered_map<uint64_t,InvertedElemPrint> tokens_map;
                //这个哈希的作用是用来去重
                for(std::string w:words)
                {
                    boost::to_lower(w);//这一步的作用是忽略大小写

                    ns_index::InvertedList* inverted_list=index->GetInvertedList(w);

                    if(inverted_list==nullptr)
                        continue;
                    //inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end());
                
                    for(const auto &elem : *inverted_list){
                    //这个for循环就相当于把这个inverted_list里面的值交给这个哈希
                        auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建
                        //注意：这个item是&，所以实际上加的是tokens_map的[elem.doc_id]，这样写是为了避免重复查找，提高效率
                        //item一定是doc_id相同的print节点
                        item.doc_id = elem.doc_id;
                        item.weight += elem.weight;
                        item.words.push_back(elem.word);
                    }
                }

                //3. 合并
                for(const auto &item : tokens_map)//这边就是把通过哈希去重分好后的词交给inverted_list_all
                    inverted_list_all.push_back(std::move(item.second));
               /* std::sort(inverted_list_all.begin(),inverted_list_all.end(),
                [](const ns_index::InvertedElem& e1,const ns_index::InvertedElem& e2){//根据weight来从大到小排序
                    return e1.weight>e2.weight;
                    
                });*/
                std::sort(inverted_list_all.begin(), inverted_list_all.end(),
                    [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){
                        return e1.weight > e2.weight;
                });
                //4. 构建
                Json::Value root;
                for(auto& item:inverted_list_all)
                {
                    ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id);
                    if(doc==nullptr)
                        continue;
                    Json::Value elem;//通过json来进行序列化
                    elem["title"]=doc->title;
                    elem["desc"]=GetDesc(doc->content,item.words[0]);
                    elem["url"]=doc->url;//这是通过json然后可以将root序列化为JSON格式的字符串接着搜索结果以JSON形式返回。

                    root.append(elem);
                }
                //
                Json::StyledWriter writer;
                *json_string=writer.write(root);//完成序列化
            }
            //这个函数的作用就是找摘要，比如说关键词K,那么我们就把K前面和后面的一部分内容作为摘要
            std::string GetDesc(const std::string& html_content,const std::string& word)
            {
                int prev_step=50;
                int next_step=100;

                auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int x,int y){
                    return (std::tolower(x)==std::tolower(y));
                });//在 html_content 这个字符串中查找 word 这个子序列
                //return (std::tolower(x)==std::tolower(y))用来忽略大小写
                if(iter==html_content.end())
                    return "None1";
                int pos=std::distance(html_content.begin(),iter);//计算找到的子序列在 html_content 中的起始位置 pos。
                if(pos==std::string::npos)
                    return  "None1";
                int start=0;
                int end=html_content.size()-1;
                if(pos-prev_step>start)
                    start=pos-prev_step;
                if(pos+next_step<end)
                    end=pos+next_step;
                
                if(start>=end) return "None2";
                std::string desc=html_content.substr(start,end-start);
                desc+="...";
                return desc;

            }
    };

}