Skip to content

feat: 引入 SuperFilter (超级滤镜) —— 高自定义、高性能的流水线候选词处理引擎#1148

Open
amzxyz wants to merge 2 commits intorime:masterfrom
amzxyz:master
Open

feat: 引入 SuperFilter (超级滤镜) —— 高自定义、高性能的流水线候选词处理引擎#1148
amzxyz wants to merge 2 commits intorime:masterfrom
amzxyz:master

Conversation

@amzxyz
Copy link
Copy Markdown

@amzxyz amzxyz commented Mar 8, 2026

💡 背景与初衷 (Motivation)

在长期的 Rime 输入法定制与使用经验中,用户对于候选词的“后处理”需求极为庞大(如:Emoji 附加、中英混输提示、简繁/方言转换、特定简码强插等)。

基于此需求,本项目抽象并实现了 SuperFilter —— 一个纯 C++ 原生的高自定义超级滤镜引擎。它通过底层的极速架构与灵活的配置逻辑,提供轻量级、高性能的文本过滤与二次处理能力,让用户在自定义数据与后期维护上变得前所未有的简单,成为 Rime 滤镜作业环境中一个大能的存在。

功能已经在万象拼音方案测试许久,性能表现优异,希望未来能成为 librime 生态中的一大利器:
https://github.com/amzxyz/rime_wanxiang/blob/wanxiang/lua/wanxiang/super_replacer.lua


⚙️ 核心运行结构 (Architecture & Pipeline)

SuperFilter 采用了极客级别的原生编译与单文件内存映射(Mmap)引擎,构建了极其高效的数据流转体系:

  • 原生宏观流水线与扁平化设计:遵循 Unix “一个组件只干一件事” 的哲学。通过在 Rime 的 engine/filters 中依次挂载不同的独立命名空间,将数据流转的任务完美融入 Rime 原生引擎。配置彻底扁平化,告别深层嵌套,数据流转严丝合缝。
  • 单文件极致编译:数据维护格式保持极简的 k\tv1|v2|v3(支持同 k 多行自动合并)。在底层,引擎会在后台瞬间将其编译为紧凑的单体二进制 .bin 文件,实现数据的极致瘦身与规整排列。
  • 真·流式处理 (True Streaming):采用多路队列分发器,按需加载候选词(Lazy Evaluation)。绝对不进行全局数组分配和全量重排,无论底库多大,时间复杂度均保持极低,真正做到打字零延迟。
  • 动态状态自洽:每个命名空间均可绑定 Rime 的 option 开关,引擎在敲击按键时实时查询状态,瞬间接入或无缝剥离处理分支。

🧰 功能模块与模式解析 (Modules & Modes)

引擎内置了四大核心 mode,以及丰富的子模式组合,足以覆盖绝大多数文本处理需求:

1. mode: replace (替换模式)

  • 功能:直接用字典中的 Value 替换当前的候选词。
  • 典型场景:简繁转换(s2t, s2hk, s2tw)、错别字自动纠正。
  • 子模式 sentence: true:开启正向最大匹配(FMM)分词。不仅能替换单字,还能对未切分的超长句子进行精准的局部替换。

2. mode: append (附加模式)

  • 功能:保留原候选词,并在其后追加产生新的候选词分支。
  • 典型场景:输入“哈哈” -> 产出“1.哈哈 2.😄 3.🐸”。输入英文 -> 追加中文翻译候选。

3. mode: comment (注释模式)

  • 功能:不改变候选词文本,仅对其注释(Comment)进行精准操作。
  • 子模式 comment_mode
    • none: 清空注释。
    • append: 继承原生拼音注释。
    • text: 彻底替换原有注释,结合 comment_format(如 〔%s〕)实现极佳的 UI 排版(例如 apple -> 苹果〔apple〕)。

4. mode: abbrev (简码/强插模式)

  • 功能:利用全新的权重与队列逻辑,将指定词汇强行注入到候选列表的特定位置。
  • 参数机制 (格式为 类型,位置/分数,生效数量):
    • order: "index,2,1":绝对排位。无视原生词频,强行将匹配到的第 1 个简码霸占第 2 候选位。
    • order: "quality,110,6":动态分数跳水。赋予前 6 个简码 110 的权重,顺应自然排序体系。
    • 空码兜底:超出设定数量的剩余简码将被赋予 quality=0,自动沉底,既不干扰正常打字,又能在无匹配词时提供兜底保障。
    • 加载键转换t9_mode: true。在此模式下,九宫格输入的字母会被智能转换为数字编码存入底库(如 lss\t老实说 变为 577\t老实说==lss),并通过切割完美回显 preedit,彻底解决九键简码的维护与注入回显痛点。

以及全局tag控制,为不同tag打造专用数据(如反查数据库的注释修饰) 产物cand_type参数自定义为后续处理提供方便(针对特定候选的放行)等等,打开了方便之门。

🛡️ 稳定性保障与问题规避 (Safeguards)

为了达到生产级标准,底层 C++ 代码进行了极度严苛的防御性编程与架构设计:

  1. 零内存拷贝的 Mmap 引擎:直接将硬盘上的 .bin 二进制文件映射至操作系统的虚拟内存。配合极限指针偏移与纯内存二分查找,在只读打字场景下提供光速查询体验,且绝对不会造成堆内存(Heap)膨胀与泄漏。
  2. 纯粹的单文件构建落点:将高度压缩的编译产物直接写入 build/ 目录。结构清爽,完美契合 Rime 的静态编译缓存区架构设计,有效避免数据同步(Sync)时的文件冲突。
  3. 隔离的热更新与多实例哈希缓存:采用命名空间进行物理文件隔离,修改某一词库只会触发对应 .bin 文件的毫秒级热重载,绝不波及其他流水线。内置的多实例缓存机制完美抵御了频繁切换窗口时的上下文重建开销。
  4. 系统级绝对防崩溃护城河:内置多重底层防线,包括四字节内存对齐保护(防 Bus Error)、UTF-8 字符残缺安全截断、Mmap 边界越界探测等。即便外部词库文件损坏或异常,引擎也只会静默放行原生词条,确保主进程稳如泰山。

🌟 总结

SuperFilter 真正强大之处在于其外部配置的高级抽象与底层的极致单文件内存映射架构。它赋予了输入法极高的可玩性:通过数据分类细化、模式分类解耦、以及利用命名空间进行宏观排列,将文本的定制能力提升到了一个全新的高度。
具体配置思路与感知,可参考后续参数说明及万象项目 lua/data 中的实测数据。

📝 配置示例 (Configuration Example)

    - super_filter@emoji     # 附加 Emoji
    - super_filter@en_cn     # 附加中英翻译
    - super_filter@others    # 常驻的自定义替换/追加
    - super_filter@abbrev    
    - super_filter@s2t       # 先把流经此处的简体字,全部转换为标准繁体  这里未来可以替代OpenCC预设数据
    - super_filter@s2hk      # 接收标准繁体,如果是 s2hk 模式,转成香港异体字
    - super_filter@s2tw      # 接收标准繁体,如果是 s2tw 模式,转成台湾异体字

# 场景1:输入 '哈哈' -> 变成 '1.哈哈 2.😄'
emoji:
  db_name: emoji_db
  delimiter: "|"
  option: emoji
  cand_type: emoji
  mode: append            # 新增候选
  comment_mode: none      # 不需要注释格式化
  tags: [abc]
  files:
    - lua/data/emoji.txt

# 场景2:输入 'hello' -> 显示 'hello 〔你好 | 哈喽〕'
en_cn:
  db_name: en_cn_db
  delimiter: "|"
  comment_format: "〔%s〕" # 保留注释格式化
  option: chinese_english
  mode: append
  comment_mode: text
  tags: [abc]
  files:
    - lua/data/english_chinese.txt
    - lua/data/chinese_english.txt

# 场景3:用于常驻的直接替换
others:
  db_name: others_db
  delimiter: "|"
  option: true            # 常驻开启
  mode: append
  comment_mode: none
  tags: [abc]
  files:
    - lua/data/others.txt

# 2. 简码与成语匹配区 (插队提取)

abbrev:
  db_name: abbrev_db
  delimiter: "|"
  option: abbrev
  mode: abbrev
  tags: [abc]
  cand_type: abbrev
  order: "quality,2,3"  # 前置位置2,前置数量3
  files:
    - lua/data/abbrev.txt
    - lua/data/chengyu.txt

# 3. 繁简与异体字转换区 (Rime 引擎接力流水线)

# 步骤A:简体转繁体
s2t:
  db_name: s2t_db
  delimiter: "|"
  comment_format: "〔%s〕"
  option: [ s2t, s2hk, s2tw ] # 这三个开关只要有一个开着,它就得工作
  mode: replace           # 替换原候选
  cand_type: abbrev
  comment_mode: append
  sentence: true          # 句子级别替换 (FMM算法)
  tags: [abc]
  files:
    - lua/data/STCharacters.txt
    - lua/data/STPhrases.txt

# 步骤B:繁体转香港特用 (接力上面的产出)
s2hk:
  db_name: s2hk_db
  delimiter: "|"
  comment_format: "〔%s〕"
  option: s2hk
  mode: replace
  comment_mode: append
  sentence: true
  tags: [abc]
  files:
    - lua/data/HKVariants.txt

# 步骤C:繁体转台湾特用 (接力上面的产出)
super_s2tw:
  db_name: s2tw_db
  delimiter: "|"
  comment_format: "〔%s〕"
  option: s2tw
  mode: replace
  comment_mode: append
  sentence: true
  tags: [abc]
  files:
    - lua/data/TWVariants.txt

🚀 未来的改进方向 (Future Roadmap)

随着本工具的强化,未来我们计划在性能与易用性上做进一步的抽象与突破:

  1. 预置资源与隐式抽象:未来可将 OpenCC 数据库等通用资源直接内置。用户在配置时无需繁琐编写规则,只需引入一个全局配置名称(模拟现有的 OpenCC JSON),引擎即可自动将其抽象为底层不可见的多个流水线流程,极大降低使用门槛。
  2. 配置流外置:支持将庞大的 rules 配置流写入专属的外部配置文件,从而为臃肿的 schema.yaml 彻底减负,也可能是一个JSON预设文件。

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new built-in Rime gear filter, SuperFilter (超级滤镜), aiming to provide a high-performance, highly configurable, native C++ pipeline for candidate post-processing (replace/append/comment/abbrev), backed by a LevelDB cache and configurable rule sets.

Changes:

  • Add SuperFilter / SuperFilterTranslation implementing streaming candidate processing plus abbrev injection logic.
  • Add rule parsing + LevelDB build/rebuild with file-signature-based invalidation and a process-level DB cache.
  • Register super_filter as a gears module component.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
src/rime/gear/super_filter.h Declares SuperFilter rule model, translation wrapper, and filter component interface.
src/rime/gear/super_filter.cc Implements config parsing, streaming translation pipeline, abbrev injection, and LevelDB signature/rebuild logic.
src/rime/gear/gears_module.cc Registers super_filter component in the gears module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@amzxyz amzxyz force-pushed the master branch 2 times, most recently from 80fa10a to c068390 Compare March 9, 2026 05:46
@lotem
Copy link
Copy Markdown
Member

lotem commented Mar 10, 2026

一個 Filter 包羅萬象,不如考慮拆成功能單一的 Filter 哪?
把 rules 打散減少一層,通過已有的 engine/filters 配置;如此還可以將具體功能與其他 Filter 組合。

如果不求通用性,只爲實現現有需求開發,可以做成 librime 插件吧。

@amzxyz
Copy link
Copy Markdown
Author

amzxyz commented Mar 10, 2026

一個 Filter 包羅萬象,不如考慮拆成功能單一的 Filter 哪? 把 rules 打散減少一層,通過已有的 engine/filters 配置;如此還可以將具體功能與其他 Filter 組合。

如果不求通用性,只爲實現現有需求開發,可以做成 librime 插件吧。

本来是想适配系统的流程,可以数据的统一管理变成了问题,流程的耦合变成了问题,这个实现恰恰强在拉出一段分支出来做了很多组合,其次数据库的应用也怕不支持,放在这里肯定要考虑通用型,需大家讨论一下,我讲一个事,内置的字符集 单字优先啥的根本没办法耦合业务逻辑,在我看来通用性极差,所以我怕真正的小小组件并不能带来通用,反而平掉了耦合、文件分离这些优质的特性,当然了还需深入思考。

@amzxyz
Copy link
Copy Markdown
Author

amzxyz commented Mar 10, 2026

一個 Filter 包羅萬象,不如考慮拆成功能單一的 Filter 哪?
把 rules 打散減少一層,通過已有的 engine/filters 配置;如此還可以將具體功能與其他 Filter 組合。

如果不求通用性,只爲實現現有需求開發,可以做成 librime 插件吧。

需要请教userdb合不合适,分离命名空间会不会造成多线访问数据库问题

@amzxyz
Copy link
Copy Markdown
Author

amzxyz commented Mar 10, 2026

🔽 :userdb版本留存 ```

// librime/src/rime/gear/super_filter.cc
#include <rime/gear/super_filter.h>
#include <rime/context.h>
#include <rime/engine.h>
#include <rime/schema.h>
#include <rime/candidate.h>
#include <rime/config.h>
#include <rime/dict/db.h>
#include <rime_api.h>
#include
#include
#include
#include <unordered_map>
#include
#include
#include

namespace rime {

// Process-level global cache for the LevelDb instance.
// Prevents high I/O latency and stuttering when Rime frequently recreates sessions (e.g., switching windows).
struct SuperDbCache {
an db;
std::string db_name;
std::string files_sig;
};

static SuperDbCache& GetGlobalDbCache() {
static SuperDbCache cache;
return cache;
}

static std::vector<size_t> GetUtf8Offsets(const std::string& text) {
std::vector<size_t> offsets;
size_t i = 0;
while (i < text.length()) {
offsets.push_back(i);
unsigned char c = text[i];
if (c < 0x80) i += 1;
else if (c < 0xE0) i += 2;
else if (c < 0xF0) i += 3;
else i += 4;
}
offsets.push_back(text.length());
return offsets;
}

static std::vectorstd::string Split(const std::string& str, const std::string& delim) {
std::vectorstd::string tokens;
if (str.empty()) return tokens;
if (delim.empty()) {
tokens.push_back(str);
return tokens;
}
size_t prev = 0, pos = 0;
do {
pos = str.find(delim, prev);
if (pos == std::string::npos) pos = str.length();
std::string token = str.substr(prev, pos - prev);
if (!token.empty()) tokens.push_back(token);
prev = pos + delim.length();
} while (pos <= str.length() && prev < str.length());
return tokens;
}

SuperFilterTranslation::SuperFilterTranslation(
an inner,
const std::vector& rules,
an db,
Context* ctx,
const std::string& delimiter,
const std::string& comment_format,
bool is_chain)
: inner_(inner), rules_(rules), db_(db), ctx_(ctx),
delimiter_(delimiter), comment_format_(comment_format), is_chain_(is_chain) {

size_t start = 0;
size_t end = ctx_->input().length();
if (inner_ && !inner_->exhausted()) {
    auto first_cand = inner_->Peek();
    if (first_cand) {
        start = first_cand->start();
        end = first_cand->end();
    }
}

// Pre-calculate abbreviation matches upon input change.
std::string seg_input = ctx_->input().substr(start, end - start);
GenerateAbbrevCandidates(seg_input, start, end);
UpdateExhausted();

}

void SuperFilterTranslation::UpdateExhausted() {
// Lazy evaluation: fetch only the required amount of candidates from the inner translation.
while (pending_candidates_.empty() && !inner_->exhausted()) {
ProcessNextInner();
}
set_exhausted(index_cands_.empty() && quality_cands_.empty() &&
pending_candidates_.empty() && lazy_cands_.empty() && inner_->exhausted());
}

an SuperFilterTranslation::Peek() {
if (exhausted()) return nullptr;

// Dispatch priority:
// 1. Exact index matches (forced insertion)
if (!index_cands_.empty() && (yield_count_ + 1) >= index_cands_.front().value) {
    return index_cands_.front().cand;
}

// 2. Quality threshold matches (dynamic insertion)
if (!quality_cands_.empty()) {
    if (pending_candidates_.empty() || pending_candidates_.front()->quality() < quality_cands_.front().value) {
        return quality_cands_.front().cand;
    }
}

// 3. Regular pipeline candidates
if (!pending_candidates_.empty()) {
    return pending_candidates_.front();
}

// 4. Flush remaining priority queues if the main pipeline is exhausted
if (!index_cands_.empty()) return index_cands_.front().cand;
if (!quality_cands_.empty()) return quality_cands_.front().cand;

// 5. Fallback candidates (quality=0)
if (!lazy_cands_.empty()) {
    return lazy_cands_.front();
}

return nullptr;

}

bool SuperFilterTranslation::Next() {
if (exhausted()) return false;

an<Candidate> p = Peek();
if (!index_cands_.empty() && p == index_cands_.front().cand) index_cands_.pop_front();
else if (!quality_cands_.empty() && p == quality_cands_.front().cand) quality_cands_.pop_front();
else if (!pending_candidates_.empty() && p == pending_candidates_.front()) pending_candidates_.pop_front();
else if (!lazy_cands_.empty() && p == lazy_cands_.front()) lazy_cands_.pop_front();

yield_count_++;
UpdateExhausted();
return !exhausted();

}

// Forward Maximum Matching algorithm for segmenting and replacing long phrases.
std::string SuperFilterTranslation::SegmentConvert(const std::string& text, const std::string& prefix, bool sentence) {
if (!db_) return text;

if (!sentence) {
    std::string val;
    if (db_->Fetch(prefix + text, &val)) {
        auto parts = Split(val, delimiter_);
        return parts.empty() ? text : parts[0];
    }
    return text;
}

std::vector<size_t> offsets = GetUtf8Offsets(text);
if (offsets.size() <= 1) return text;
size_t char_count = offsets.size() - 1;
std::string result;
size_t i = 0;
const size_t MAX_LOOKAHEAD = 6;

while (i < char_count) {
    bool matched = false;
    size_t max_j = std::min(i + MAX_LOOKAHEAD, char_count);
    for (size_t j = max_j; j > i; --j) {
        std::string sub_text = text.substr(offsets[i], offsets[j] - offsets[i]);
        std::string val;
        if (db_->Fetch(prefix + sub_text, &val)) {
            auto parts = Split(val, delimiter_);
            result += parts.empty() ? sub_text : parts[0];
            i = j;
            matched = true;
            break;
        }
    }
    if (!matched) {
        std::string single = text.substr(offsets[i], offsets[i+1] - offsets[i]);
        std::string val;
        if (db_->Fetch(prefix + single, &val)) {
            auto parts = Split(val, delimiter_);
            result += parts.empty() ? single : parts[0];
        } else {
            result += single;
        }
        i++;
    }
}
return result;

}

void SuperFilterTranslation::GenerateAbbrevCandidates(const std::string& input_code, size_t start, size_t end) {
if (!db_) return;
abbrev_yielded_.clear();

for (const auto& r : rules_) {
    if (r.mode == "abbrev") {
        // Tag filtering and prefix truncation protection
        if (!r.tags.empty()) {
            bool is_tag_match = false;
            size_t seg_start = 0;
            
            if (!ctx_->composition().empty()) {
                const auto& seg = ctx_->composition().back();
                seg_start = seg.start;
                for (const auto& req_tag : r.tags) {
                    if (seg.HasTag(req_tag)) { 
                        is_tag_match = true; 
                        break; 
                    }
                }
            }
            
            bool is_pure_chars = std::all_of(input_code.begin(), input_code.end(), 
                                             [](unsigned char c){ return std::isalnum(c); });

            if (!is_tag_match || seg_start != 0 || !is_pure_chars) {
                continue; 
            }
        }

        bool is_active = r.always_on;
        if (!is_active) {
            // Dynamically check Rime options state
            for (const auto& opt : r.options) {
                if (ctx_->get_option(opt)) { is_active = true; break; }
            }
        }
        if (!is_active) continue;

        std::string val;
        if (!db_->Fetch(r.prefix + input_code, &val)) {
            std::string upper_code = input_code;
            for (auto& c : upper_code) {
                c = static_cast<char>(std::toupper(static_cast<unsigned char>(c)));
            }
            db_->Fetch(r.prefix + upper_code, &val);
        }

        if (!val.empty()) {
            auto parts = Split(val, delimiter_);
            int count = 0;
            for (const auto& p : parts) {
                std::string item_text = p;
                std::string item_preedit = input_code;

                if (abbrev_yielded_.count(item_text)) continue;
                abbrev_yielded_.insert(item_text);
                count++;

                auto cand = New<SimpleCandidate>(r.cand_type, start, end, item_text, "");
                cand->set_preedit(item_preedit);

                if (count <= r.always_qty) {
                    if (r.order_type == "index") {
                        cand->set_quality(999);
                        index_cands_.push_back({cand, r.order_value + (count - 1)});
                    } else if (r.order_type == "quality") {
                        cand->set_quality(r.order_value);
                        quality_cands_.push_back({cand, r.order_value});
                    }
                } else {
                    // Fallback candidates sink to the bottom with quality 0
                    cand->set_quality(0);
                    lazy_cands_.push_back(cand);
                }
            }
        }
    }
}

std::sort(index_cands_.begin(), index_cands_.end(), [](const InjectCand& a, const InjectCand& b) {
    return a.value < b.value;
});
std::sort(quality_cands_.begin(), quality_cands_.end(), [](const InjectCand& a, const InjectCand& b) {
    return a.value > b.value;
});

}

struct CandData {
std::string text;
std::string comment;
std::string cand_type;
bool is_original;
};

void SuperFilterTranslation::ProcessNextInner() {
if (inner_->exhausted()) return;
auto cand = inner_->Peek();
inner_->Next();

if (!cand) return;

std::vector<CandData> current_items;
current_items.push_back({cand->text(), cand->comment(), cand->type(), true});

if (db_) {
    for (const auto& r : rules_) {
        if (r.mode == "abbrev") continue;

        // Strict tag filtering for regular replacement/append/comment modes
        if (!r.tags.empty()) {
            bool is_tag_match = false;
            if (!ctx_->composition().empty()) {
                const auto& seg = ctx_->composition().back();
                for (const auto& req_tag : r.tags) {
                    if (seg.HasTag(req_tag)) { 
                        is_tag_match = true; 
                        break; 
                    }
                }
            }
            if (!is_tag_match) continue; 
        }

        bool is_active = r.always_on;
        if (!is_active) {
            for (const auto& opt : r.options) {
                if (ctx_->get_option(opt)) { is_active = true; break; }
            }
        }
        if (!is_active) continue;

        std::vector<CandData> next_items;

        for (const auto& item : current_items) {
            std::string val;
            if (r.sentence) {
                std::string fmm_res = SegmentConvert(item.text, r.prefix, true);
                if (fmm_res != item.text) val = fmm_res;
            } else {
                db_->Fetch(r.prefix + item.text, &val);
            }

            if (!val.empty()) {
                auto parts = Split(val, delimiter_);
                if (r.t9_mode) {
                    for (auto& p : parts) {
                        size_t delim_pos = p.find("==");
                        if (delim_pos != std::string::npos) {
                            p = p.substr(0, delim_pos);
                        }
                    }
                }
                
                std::string rule_comment = "";
                if (r.comment_mode == "text" && !item.text.empty()) {
                    std::string cfmt = comment_format_;
                    size_t pos = cfmt.find("%s");
                    if (pos != std::string::npos) {
                        cfmt.replace(pos, 2, item.text);
                        rule_comment = cfmt;
                    } else {
                        rule_comment = item.text;
                    }
                } else if (r.comment_mode == "append") {
                    rule_comment = item.comment;
                }

                if (r.mode == "replace") {
                    for (size_t i = 0; i < parts.size(); ++i) {
                        std::string final_comment = (i == 0 && r.comment_mode == "none") ? "" : rule_comment;
                        std::string ctype = (i == 0 && item.is_original) ? item.cand_type : r.cand_type;
                        next_items.push_back({parts[i], final_comment, ctype, false});
                    }
                } else if (r.mode == "append") {
                    next_items.push_back(item);
                    for (const auto& p : parts) {
                        std::string final_comment = (r.comment_mode == "none") ? "" : rule_comment;
                        next_items.push_back({p, final_comment, r.cand_type, false});
                    }
                } else if (r.mode == "comment") {
                    std::string joined;
                    for(size_t i = 0; i < parts.size(); ++i) { 
                        joined += parts[i] + (i < parts.size() - 1 ? " " : ""); 
                    }
                    
                    std::string cfmt = comment_format_;
                    size_t pos = cfmt.find("%s");
                    if (pos != std::string::npos) {
                        cfmt.replace(pos, 2, joined);
                    } else {
                        cfmt = joined;
                    }
                    
                    std::string new_comment;
                    if (r.comment_mode == "none") {
                        new_comment = ""; 
                    } else if (r.comment_mode == "text") {
                        new_comment = cfmt; 
                    } else {
                        new_comment = item.comment + cfmt; 
                    }
                    
                    next_items.push_back({item.text, new_comment, item.cand_type, item.is_original});
                }
            } else {
                next_items.push_back(item);
            }
        }

        // Pipeline flow control: Hand over the payload to the next rule if chain mode is enabled.
        if (is_chain_) {
            current_items = std::move(next_items);
        } else {
            std::vector<CandData> parallel_merged;
            for (const auto& og : current_items) parallel_merged.push_back(og);
            for (const auto& nx : next_items) {
                if (!nx.is_original) parallel_merged.push_back(nx);
            }
            current_items = std::move(parallel_merged);
        }
    }
}

for (const auto& result : current_items) {
    auto nc = New<SimpleCandidate>(result.cand_type, cand->start(), cand->end(), result.text, result.comment);
    nc->set_quality(cand->quality());
    nc->set_preedit(cand->preedit());
    pending_candidates_.push_back(nc);
}

}

SuperFilter::SuperFilter(const Ticket& ticket) : Filter(ticket) {
if (ticket.schema) {
LoadConfig(ticket.schema->config());
InitializeDb();
}
}

SuperFilter::~SuperFilter() {
// The globally cached LevelDb instance remains open across session lifetimes.
}

void SuperFilter::LoadConfig(Config* config) {
config->GetString("super_filter/db_name", &db_name_);

if (!db_name_.empty()) {
    db_name_ = std::filesystem::path(db_name_).filename().string();
}

if (db_name_.empty() || db_name_ == "." || db_name_ == "..") {
    db_name_ = "super_filter";
}
db_name_ = "data/" + db_name_;

config->GetString("super_filter/delimiter", &delimiter_);
if (delimiter_.empty()) delimiter_ = "|";

config->GetString("super_filter/comment_format", &comment_format_);
if (comment_format_.empty()) comment_format_ = "〔%s〕";

config->GetBool("super_filter/chain", &chain_);

auto root = config->GetItem("super_filter/rules");
if (auto rule_list = As<ConfigList>(root)) {
    for (size_t i = 0; i < rule_list->size(); ++i) {
        auto item = As<ConfigMap>(rule_list->GetAt(i));
        if (!item) continue;

        SuperRule rule;

        if (auto name_val = As<ConfigValue>(item->Get("name"))) {
            rule.name = name_val->str();
        } else {
            rule.name = "Rule_" + std::to_string(i + 1);
        }

        auto opt_node = item->Get("option");
        if (auto opt_val = As<ConfigValue>(opt_node)) {
            if (opt_val->str() == "true") {
                rule.always_on = true;
            } else if (opt_val->str() == "false") {
                // Explicitly frozen rule, option vector remains empty.
            } else {
                rule.options.push_back(opt_val->str());
            }
        } else if (auto opt_list = As<ConfigList>(opt_node)) {
            for (size_t j = 0; j < opt_list->size(); ++j) {
                if (auto v = As<ConfigValue>(opt_list->GetAt(j))) {
                    rule.options.push_back(v->str());
                }
            }
        }

        // Discard disabled or misconfigured rules during the parse phase to save CPU cycles.
        if (!rule.always_on && rule.options.empty()) {
            LOG(INFO) << "super_filter: [" << rule.name << "] frozen or missing option, safely ignored.";
            continue;
        }
        
        auto tag_node = item->Get("tags");
        if (!tag_node) tag_node = item->Get("tag");

        if (auto tag_val = As<ConfigValue>(tag_node)) {
            rule.tags.push_back(tag_val->str());
        } else if (auto tag_list = As<ConfigList>(tag_node)) {
            for (size_t j = 0; j < tag_list->size(); ++j) {
                if (auto v = As<ConfigValue>(tag_list->GetAt(j))) {
                    rule.tags.push_back(v->str());
                }
            }
        }
        
        if (auto mode_val = As<ConfigValue>(item->Get("mode"))) rule.mode = mode_val->str();
        else rule.mode = "append";

        if (rule.mode != "append" && rule.mode != "replace" && rule.mode != "comment" && rule.mode != "abbrev") {
            LOG(WARNING) << "super_filter: [" << rule.name << "] unsupported mode '" << rule.mode << "', skipping.";
            continue;
        }

        if (auto sent_val = As<ConfigValue>(item->Get("sentence"))) {
            if (sent_val->str() == "true") rule.sentence = true;
        }

        if (auto pre_val = As<ConfigValue>(item->Get("prefix"))) {
            rule.prefix = pre_val->str();
        } else {
            rule.prefix = "";
        }

        if (auto cmod_val = As<ConfigValue>(item->Get("comment_mode"))) rule.comment_mode = cmod_val->str();
        else rule.comment_mode = "none";

        if (auto ctype_val = As<ConfigValue>(item->Get("cand_type"))) rule.cand_type = ctype_val->str();
        else rule.cand_type = "derived";

        if (auto t9_val = As<ConfigValue>(item->Get("t9_mode"))) {
            rule.t9_mode = (t9_val->str() == "true");
            if (rule.t9_mode && rule.mode != "abbrev") {
                LOG(WARNING) << "super_filter: [" << rule.name << "] t9_mode restricted to abbrev mode. Ignored.";
                rule.t9_mode = false;
            }
        }

        if (rule.mode == "abbrev") {
            auto ord_val = As<ConfigValue>(item->Get("order"));
            if (!ord_val) {
                LOG(WARNING) << "super_filter: [" << rule.name << "] missing 'order' parameter in abbrev mode.";
                continue; 
            }
            auto parts = Split(ord_val->str(), ",");
            if (parts.size() < 2) {
                LOG(WARNING) << "super_filter: [" << rule.name << "] malformed 'order' format.";
                continue;
            }
            try {
                rule.order_type = parts[0];
                rule.order_value = std::stoi(parts[1]);
                if (parts.size() >= 3) rule.always_qty = std::stoi(parts[2]);
            } catch (...) {
                LOG(WARNING) << "super_filter: [" << rule.name << "] parse exception in 'order'.";
                continue; 
            }
        } 

        auto files_node = item->Get("files");
        if (!files_node || (!As<ConfigList>(files_node) && !As<ConfigValue>(files_node))) {
            LOG(WARNING) << "super_filter: [" << rule.name << "] missing 'files' dependency.";
            continue;
        }

        if (auto files_list = As<ConfigList>(item->Get("files"))) {
            for(size_t j = 0; j < files_list->size(); ++j) {
                if (auto f = As<ConfigValue>(files_list->GetAt(j))) {
                    std::string filepath = f->str();
                    if (!filepath.empty() && 
                        filepath.front() != '/' && filepath.front() != '\\' && 
                        filepath.find("..") == std::string::npos) {
                        rule.files.push_back(filepath);
                    } else {
                        LOG(WARNING) << "super_filter: [" << rule.name << "] Invalid file path ignored: " << filepath;
                    }
                }
            }
        }
        if (rule.files.empty()) {
            LOG(WARNING) << "super_filter: [" << rule.name << "] No valid files available, skipping rule.";
            continue;
        }
        
        rules_.push_back(rule);
    }
}

}

// Generates a stringent signature combining prefixes, file paths, and system attributes
// to accurately trigger database rebuilds only when necessary.
std::string SuperFilter::GenerateFilesSignature() {
std::string sig = "delim:" + delimiter_ + "||";
std::string user_dir = string(rime_get_api()->get_user_data_dir());
std::error_code ec_exist;

for (const auto& rule : rules_) {
    sig += "t9:" + std::to_string(rule.t9_mode) + "@";
    for (const auto& path : rule.files) {
        sig += "prefix:" + rule.prefix + "@path:" + path + "=";
        std::filesystem::path full_path = user_dir + "/" + path;
        
        if (std::filesystem::exists(full_path, ec_exist) && !ec_exist) {
            std::error_code ec_time;
            auto ftime = std::filesystem::last_write_time(full_path, ec_time);
            
            std::error_code ec_size;
            auto fsize = std::filesystem::file_size(full_path, ec_size);
            
            if (!ec_time && !ec_size) {
                auto time_sec = std::chrono::duration_cast<std::chrono::seconds>(ftime.time_since_epoch()).count();
                sig += std::to_string(fsize) + "_" + std::to_string(time_sec) + "|";
            }
        }
    }
}
return sig;

}

static std::mutex g_db_cache_mutex;

void SuperFilter::InitializeDb() {
std::lock_guardstd::mutex lock(g_db_cache_mutex);
auto& cache = GetGlobalDbCache();
std::string current_sig = GenerateFilesSignature();

// Cache Hit: Instantly mount the pre-opened LevelDb to eliminate I/O lag.
if (cache.db && cache.db_name == db_name_ && cache.files_sig == current_sig) {
    db_ = cache.db;
    return;
}

std::string user_dir = string(rime_get_api()->get_user_data_dir());
std::error_code ec_dir;
std::filesystem::create_directories(user_dir + "/data", ec_dir);
if (ec_dir) {
    LOG(ERROR) << "super_filter: Failed to create data directory '" << (user_dir + "/data") 
               << "': " << ec_dir.message();
    return;
}

auto* db_component = Db::Require("userdb");
if (!db_component) return;

an<Db> new_db = an<Db>(db_component->Create(db_name_));
if (!new_db) return;

bool need_rebuild = false;

if (new_db->OpenReadOnly()) {
    std::string db_sig;
    new_db->MetaFetch("_files_sig", &db_sig);
    if (db_sig != current_sig) need_rebuild = true;
    new_db->Close();
} else {
    need_rebuild = true;
}

if (need_rebuild) {
    if (new_db->Open()) {
        LOG(INFO) << "super_filter: Database schema updated, initiating LevelDb rebuild...";
        db_ = new_db;
        RebuildDb();
        new_db->MetaUpdate("_files_sig", current_sig);
        LOG(INFO) << "super_filter: LevelDb rebuild complete.";
        new_db->Close();
        db_.reset();
    }
}

if (new_db->OpenReadOnly()) {
    cache.db = new_db;
    cache.db_name = db_name_;
    cache.files_sig = current_sig;
    db_ = new_db;
}

}

// Data structure for in-memory sorting before writing to LevelDb
struct DictItem {
std::string value;
double weight;
int order;
};

void SuperFilter::RebuildDb() {
if (db_) {
auto accessor = db_->Query("");
if (accessor) {
std::string key, value;
while (!accessor->exhausted()) {
if (accessor->GetNextRecord(&key, &value)) {
db_->Erase(key);
}
}
}
}

std::string user_dir = string(rime_get_api()->get_user_data_dir());
for (const auto& rule : rules_) {
    // Build a temporary in-memory map to aggregate keys across multiple lines/files
    std::unordered_map<std::string, std::vector<DictItem>> merged_data;
    int line_counter = 0;

    for (const auto& path : rule.files) {
        std::string full_path = user_dir + "/" + path;
        std::ifstream file(full_path);
        if (!file.is_open()) {
            LOG(WARNING) << "super_filter: Could not open file (missing or locked): " << full_path;
            continue;
        }
        
        std::string line;
        while (std::getline(file, line)) {
            if (line.empty() || line[0] == '#') continue;

            size_t sep1 = line.find_first_of(" \t");
            if (sep1 != std::string::npos) {
                std::string key = line.substr(0, sep1);
                std::string orig_key = key;

                static const char t9_map[26] = {
                    '2','2','2', '3','3','3', '4','4','4', '5','5','5', '6','6','6', 
                    '7','7','7','7', '8','8','8', '9','9','9','9'
                };

                if (rule.t9_mode) {
                    for (char& c : key) {
                        if (c >= 'a' && c <= 'z') c = t9_map[c - 'a'];
                        else if (c >= 'A' && c <= 'Z') c = t9_map[c - 'A'];
                    }
                }

                size_t val_start = line.find_first_not_of(" \t", sep1);
                if (val_start != std::string::npos) {
                    std::string rest = line.substr(val_start);
                    rest.erase(rest.find_last_not_of("\r\n \t") + 1); // trim right

                    std::string val = rest;
                    
                    if (rule.t9_mode && val.find("==") == std::string::npos) {
                        val = val + "==" + orig_key;
                    }
                    
                    double weight = 0.0;

                    // Try to extract weight from a potential 3rd column
                    size_t last_delim = rest.find_last_of(" \t");
                    if (last_delim != std::string::npos) {
                        size_t weight_start = rest.find_first_not_of(" \t", last_delim);
                        if (weight_start != std::string::npos) {
                            std::string weight_str = rest.substr(weight_start);
                            try {
                                size_t parsed_len;
                                weight = std::stod(weight_str, &parsed_len);
                                // Ensure the parsed number spans the entire rest of the string
                                if (parsed_len == weight_str.length()) {
                                    val = rest.substr(0, last_delim);
                                    val.erase(val.find_last_not_of(" \t") + 1);
                                } else {
                                    weight = 0.0;
                                }
                            } catch (...) {
                                weight = 0.0;
                            }
                        }
                    }
                    
                    // Push into the map (grouped by prefix + key)
                    merged_data[rule.prefix + key].push_back({val, weight, line_counter++});
                }
            }
        }
    }

    // Sort items by weight and merge them into a single string for DB insertion
    for (auto& kv : merged_data) {
        auto& items = kv.second;
        
        // Sort logic: Descending by weight, Ascending by original read order
        std::sort(items.begin(), items.end(), [](const DictItem& a, const DictItem& b) {
            if (a.weight != b.weight) return a.weight > b.weight;
            return a.order < b.order;
        });

        std::string final_val;
        for (size_t i = 0; i < items.size(); ++i) {
            final_val += items[i].value;
            if (i < items.size() - 1) final_val += delimiter_;
        }

        db_->Update(kv.first, final_val);
    }
}

}

an SuperFilter::Apply(
an translation,
CandidateList* candidates) {

if (!translation) return nullptr;
Context* ctx = engine_->context();

if (!ctx->IsComposing() || ctx->input().empty()) {
    return translation;
}

return New<SuperFilterTranslation>(translation, rules_, db_, ctx, delimiter_, comment_format_, chain_);

}

} // namespace rime

// librime/src/rime/gear/super_filter.h
#ifndef RIME_SUPER_FILTER_H_
#define RIME_SUPER_FILTER_H_

#include
#include
#include
#include <unordered_set>
#include <rime/common.h>
#include <rime/component.h>
#include <rime/filter.h>
#include <rime/translation.h>
#include <rime/dict/db.h>
#include <rime/context.h>
#include <rime/config.h>

namespace rime {

// Representation of a single filter rule configured in YAML.
struct SuperRule {
std::string name;
bool always_on = false;
std::vectorstd::string options;
std::vectorstd::string tags;

std::string mode;         // Supported modes: append, replace, comment, abbrev
bool sentence = false;    // Enable FMM (Forward Maximum Matching) for long phrases

std::string prefix;
std::vector<std::string> files;
bool t9_mode = false;

std::string cand_type = "derived";
std::string comment_mode; // Supported modes: none, text, append
std::string order_type = "index"; // 'index' (absolute position) or 'quality' (score threshold)
int order_value = 1;
int always_qty = 1;

};

// Wrapper for candidates that require forced injection at specific positions or quality thresholds.
struct InjectCand {
an cand;
int value;
};

// The core translation class implementing lazy evaluation and stream processing.
class SuperFilterTranslation : public Translation {
public:
SuperFilterTranslation(an inner,
const std::vector& rules,
an db,
Context* ctx,
const std::string& delimiter,
const std::string& comment_format,
bool is_chain);

an<Candidate> Peek() override;
bool Next() override;

private:
void GenerateAbbrevCandidates(const std::string& input_code, size_t start, size_t end);
void ProcessNextInner();
void UpdateExhausted();

std::string SegmentConvert(const std::string& text, const std::string& prefix, bool sentence);

an<Translation> inner_;
std::vector<SuperRule> rules_;
an<Db> db_;
Context* ctx_;
std::string delimiter_;
std::string comment_format_;
bool is_chain_;

int yield_count_ = 0;

// Priority queues for candidate distribution
std::deque<InjectCand> index_cands_;
std::deque<InjectCand> quality_cands_;
std::deque<an<Candidate>> lazy_cands_;
std::deque<an<Candidate>> pending_candidates_;
std::unordered_set<std::string> abbrev_yielded_;

};

// Filter component responsible for parsing configurations and managing the global LevelDb connection.
class SuperFilter : public Filter {
public:
explicit SuperFilter(const Ticket& ticket);
virtual ~SuperFilter();

an<Translation> Apply(an<Translation> translation,
                      CandidateList* candidates) override;

private:
void LoadConfig(Config* config);
void InitializeDb();
std::string GenerateFilesSignature();
void RebuildDb();

std::vector<SuperRule> rules_;
an<Db> db_;
std::string db_name_;
std::string delimiter_;
std::string comment_format_;
bool chain_ = false;

};

} // namespace rime

#endif // RIME_SUPER_FILTER_H_

</details>

@amzxyz
Copy link
Copy Markdown
Author

amzxyz commented Mar 10, 2026

@lotem 完成的新的结构,按要求打散了,与rime使用逻辑一致,并使用简单的bin编译方式进行了打包,不再使用userdb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants