|
|
@@ -0,0 +1,477 @@
|
|
|
+# Epic 4: 术语提取与替换 (P0 优先级)
|
|
|
+
|
|
|
+**优先级**: **P0** (Phase 0 验证确认术语表对翻译质量至关重要)
|
|
|
+**估算**: 26 故事点 (Phase 1 范围)
|
|
|
+**依赖**: 无
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Epic 目标
|
|
|
+
|
|
|
+实现术语表功能,确保翻译过程中角色名和专有术语保持一致,保证翻译可用性。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 为什么是 P0?
|
|
|
+
|
|
|
+**Phase 0 技术验证发现**:
|
|
|
+
|
|
|
+| 场景 | 原文 | 无术语表 | 有术语表 |
|
|
|
+|-----|------|---------|---------|
|
|
|
+| 角色名 | 林风 | Lin wind ❌ | Lin Feng ✅ |
|
|
|
+| 专有名词 | BMAD | BMAd ❌ | BMAD ✅ |
|
|
|
+| 技能名 | 火球术 | fire ball ❌ | Fireball ✅ |
|
|
|
+
|
|
|
+**结论**: 没有术语表功能,翻译内容**不可用**。术语表是保证翻译质量的核心功能。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 用户价值
|
|
|
+
|
|
|
+**As a** 翻译用户,
|
|
|
+**I want** 定义和使用术语表,
|
|
|
+**So that** 翻译后的内容中角色名和专有术语保持一致。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 技术栈
|
|
|
+
|
|
|
+- **数据结构**: `Dict[str, str]` (术语 → 翻译)
|
|
|
+- **匹配算法**: 最长匹配(按长度降序)
|
|
|
+- **占位符**: `__en__` 前缀标记
|
|
|
+- **测试框架**: `pytest==7.4.0`
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Phase 1 Story 列表 (核心功能)
|
|
|
+
|
|
|
+### Story 4.1: 设计术语表数据结构
|
|
|
+
|
|
|
+**估算**: 4 SP
|
|
|
+
|
|
|
+**描述**: 设计术语表数据结构,支持术语和翻译的存储。
|
|
|
+
|
|
|
+**验收标准**:
|
|
|
+
|
|
|
+```python
|
|
|
+from typing import Dict, List
|
|
|
+from dataclasses import dataclass
|
|
|
+
|
|
|
+@dataclass
|
|
|
+class GlossaryEntry:
|
|
|
+ """术语表条目"""
|
|
|
+ source: str # 原文术语,如 "林风"
|
|
|
+ target: str # 目标翻译,如 "Lin Feng"
|
|
|
+ category: str # 术语类型:CHARACTER, SKILL, LOCATION, ITEM, OTHER
|
|
|
+ context: str = "" # 上下文说明
|
|
|
+
|
|
|
+class Glossary:
|
|
|
+ """术语表"""
|
|
|
+
|
|
|
+ def __init__(self):
|
|
|
+ self._terms: Dict[str, GlossaryEntry] = {}
|
|
|
+
|
|
|
+ def add(self, entry: GlossaryEntry) -> None:
|
|
|
+ """添加术语"""
|
|
|
+ pass
|
|
|
+
|
|
|
+ def get(self, source: str) -> Optional[GlossaryEntry]:
|
|
|
+ """获取术语翻译"""
|
|
|
+ pass
|
|
|
+
|
|
|
+ def remove(self, source: str) -> bool:
|
|
|
+ """删除术语"""
|
|
|
+ pass
|
|
|
+
|
|
|
+ def get_all(self) -> List[GlossaryEntry]:
|
|
|
+ """获取所有术语"""
|
|
|
+ pass
|
|
|
+
|
|
|
+ def sort_by_length_desc(self) -> List[str]:
|
|
|
+ """按长度降序排列术语(用于匹配)"""
|
|
|
+ pass
|
|
|
+```
|
|
|
+
|
|
|
+**技术任务**:
|
|
|
+1. 创建 `src/glossary/models.py`
|
|
|
+2. 定义 `GlossaryEntry` 数据类
|
|
|
+3. 实现 `Glossary` 类
|
|
|
+4. 编写单元测试
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Story 4.2: 实现术语匹配引擎
|
|
|
+
|
|
|
+**估算**: 6 SP
|
|
|
+
|
|
|
+**描述**: 实现最长匹配算法,确保长术语优先匹配(避免"魔法"覆盖"魔法师")。
|
|
|
+
|
|
|
+**验收标准**:
|
|
|
+
|
|
|
+```python
|
|
|
+class GlossaryMatcher:
|
|
|
+ """术语匹配引擎"""
|
|
|
+
|
|
|
+ def __init__(self, glossary: Glossary):
|
|
|
+ self.glossary = glossary
|
|
|
+ # 按长度降序排列,确保长术语优先匹配
|
|
|
+ self._sorted_terms = glossary.sort_by_length_desc()
|
|
|
+
|
|
|
+ def find_matches(self, text: str) -> List[TermMatch]:
|
|
|
+ """在文本中查找所有术语匹配"""
|
|
|
+ pass
|
|
|
+
|
|
|
+ def replace_with_placeholder(self, text: str) -> Tuple[str, Dict[str, str]]:
|
|
|
+ """将术语替换为占位符
|
|
|
+
|
|
|
+ 返回: (替换后的文本, 占位符映射)
|
|
|
+ 占位符格式: __en__林风
|
|
|
+ """
|
|
|
+ pass
|
|
|
+
|
|
|
+ def restore_from_placeholder(self, text: str, mapping: Dict[str, str]) -> str:
|
|
|
+ """将占位符还原为术语翻译"""
|
|
|
+ pass
|
|
|
+
|
|
|
+@dataclass
|
|
|
+class TermMatch:
|
|
|
+ """术语匹配结果"""
|
|
|
+ source: str # 原文术语
|
|
|
+ target: str # 目标翻译
|
|
|
+ start: int # 在文本中的起始位置
|
|
|
+ end: int # 在文本中的结束位置
|
|
|
+ placeholder: str # 占位符
|
|
|
+```
|
|
|
+
|
|
|
+**匹配规则**:
|
|
|
+1. 按术语长度降序匹配(长术语优先)
|
|
|
+2. 不重叠匹配(已匹配位置不再匹配)
|
|
|
+3. 区分大小写
|
|
|
+4. 支持多词术语(如"火球术"、"三阶魔法师")
|
|
|
+
|
|
|
+**示例**:
|
|
|
+```python
|
|
|
+# 输入
|
|
|
+text = "林风释放了火球术"
|
|
|
+glossary = {
|
|
|
+ "林风": "Lin Feng",
|
|
|
+ "火球术": "Fireball"
|
|
|
+}
|
|
|
+
|
|
|
+# 输出
|
|
|
+processed = "__en__林风释放了__en__火球术"
|
|
|
+mapping = {
|
|
|
+ "__en__林风": "Lin Feng",
|
|
|
+ "__en__火球术": "Fireball"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**技术任务**:
|
|
|
+1. 创建 `src/glossary/matcher.py`
|
|
|
+2. 实现最长匹配算法
|
|
|
+3. 实现占位符替换
|
|
|
+4. 编写单元测试
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Story 4.3: 实现术语预处理管道
|
|
|
+
|
|
|
+**估算**: 5 SP
|
|
|
+
|
|
|
+**描述**: 在翻译前处理文本,将术语替换为占位符。
|
|
|
+
|
|
|
+**验收标准**:
|
|
|
+
|
|
|
+```python
|
|
|
+class GlossaryPreprocessor:
|
|
|
+ """术语预处理管道"""
|
|
|
+
|
|
|
+ def __init__(self, glossary: Glossary):
|
|
|
+ self.glossary = glossary
|
|
|
+ self.matcher = GlossaryMatcher(glossary)
|
|
|
+
|
|
|
+ def process(self, text: str) -> PreprocessingResult:
|
|
|
+ """处理文本,替换术语为占位符
|
|
|
+
|
|
|
+ 返回包含:
|
|
|
+ - processed_text: 处理后的文本
|
|
|
+ - placeholder_map: 占位符映射
|
|
|
+ - term_stats: 术语统计
|
|
|
+ """
|
|
|
+ pass
|
|
|
+
|
|
|
+ def process_batch(self, texts: List[str]) -> List[PreprocessingResult]:
|
|
|
+ """批量处理文本"""
|
|
|
+ pass
|
|
|
+
|
|
|
+ def calculate_retention_rate(self, original: str, processed: str) -> float:
|
|
|
+ """计算术语保留率"""
|
|
|
+ pass
|
|
|
+
|
|
|
+@dataclass
|
|
|
+class PreprocessingResult:
|
|
|
+ """预处理结果"""
|
|
|
+ processed_text: str
|
|
|
+ placeholder_map: Dict[str, str]
|
|
|
+ terms_found: Dict[str, int] # 术语 → 出现次数
|
|
|
+ retention_rate: float # 保留率百分比
|
|
|
+```
|
|
|
+
|
|
|
+**处理流程**:
|
|
|
+1. 加载术语表
|
|
|
+2. 初始化匹配引擎
|
|
|
+3. 查找所有术语匹配
|
|
|
+4. 替换为占位符(`__en__`前缀)
|
|
|
+5. 生成占位符映射
|
|
|
+6. 计算保留率
|
|
|
+
|
|
|
+**技术任务**:
|
|
|
+1. 创建 `src/glossary/preprocessor.py`
|
|
|
+2. 实现预处理管道
|
|
|
+3. 实现批量处理
|
|
|
+4. 实现保留率计算
|
|
|
+5. 编写单元测试
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Story 4.4: 实现后处理模块
|
|
|
+
|
|
|
+**估算**: 6 SP
|
|
|
+
|
|
|
+**描述**: 翻译后处理,去除 `__en__` 前缀并还原术语翻译。
|
|
|
+
|
|
|
+**验收标准**:
|
|
|
+
|
|
|
+```python
|
|
|
+class GlossaryPostprocessor:
|
|
|
+ """术语后处理模块"""
|
|
|
+
|
|
|
+ def __init__(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def process(self, translated_text: str, placeholder_map: Dict[str, str]) -> str:
|
|
|
+ """处理翻译后的文本
|
|
|
+
|
|
|
+ 步骤:
|
|
|
+ 1. 查找所有 __en__ 前缀的占位符
|
|
|
+ 2. 从映射表中获取翻译
|
|
|
+ 3. 替换占位符为翻译
|
|
|
+ 4. 修复可能出现的标点问题
|
|
|
+ """
|
|
|
+ pass
|
|
|
+
|
|
|
+ def fix_punctuation(self, text: str) -> str:
|
|
|
+ """修复标点符号
|
|
|
+
|
|
|
+ 处理翻译可能产生的标点问题:
|
|
|
+ - __en__林风. → Lin Feng. (去除多余空格)
|
|
|
+ - __en__林风, → Lin Feng, (修复中文标点)
|
|
|
+ """
|
|
|
+ pass
|
|
|
+
|
|
|
+ def validate_translation(self, original: str, translated: str,
|
|
|
+ placeholder_map: Dict[str, str]) -> ValidationResult:
|
|
|
+ """验证翻译完整性
|
|
|
+
|
|
|
+ 检查:
|
|
|
+ - 所有占位符都被替换
|
|
|
+ - 翻译包含所有术语
|
|
|
+ - 没有遗漏的术语
|
|
|
+ """
|
|
|
+ pass
|
|
|
+
|
|
|
+@dataclass
|
|
|
+class ValidationResult:
|
|
|
+ """验证结果"""
|
|
|
+ is_valid: bool
|
|
|
+ missing_terms: List[str] # 遗漏的术语
|
|
|
+ extra_placeholders: List[str] # 未替换的占位符
|
|
|
+```
|
|
|
+
|
|
|
+**处理流程**:
|
|
|
+1. 查找所有 `__en__` 前缀
|
|
|
+2. 从映射表获取翻译
|
|
|
+3. 替换占位符
|
|
|
+4. 修复标点问题
|
|
|
+5. 验证完整性
|
|
|
+
|
|
|
+**技术任务**:
|
|
|
+1. 创建 `src/glossary/postprocessor.py`
|
|
|
+2. 实现占位符还原
|
|
|
+3. 实现标点修复
|
|
|
+4. 实现翻译验证
|
|
|
+5. 编写单元测试
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Story 4.6: 单元测试 + 集成测试
|
|
|
+
|
|
|
+**估算**: 5 SP
|
|
|
+
|
|
|
+**描述**: 完整的测试覆盖,包括单元测试和端到端集成测试。
|
|
|
+
|
|
|
+**验收标准**:
|
|
|
+
|
|
|
+- 代码覆盖率 >= 90%
|
|
|
+- 所有边界条件测试
|
|
|
+- 端到端集成测试
|
|
|
+
|
|
|
+**测试用例**:
|
|
|
+
|
|
|
+```python
|
|
|
+class TestGlossary:
|
|
|
+ def test_add_and_retrieve_term(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_remove_term(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_sort_by_length_desc(self):
|
|
|
+ """测试长术语排在前面"""
|
|
|
+ pass
|
|
|
+
|
|
|
+class TestGlossaryMatcher:
|
|
|
+ def test_find_single_term(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_longest_term_priority(self):
|
|
|
+ """测试长术语优先匹配"""
|
|
|
+ text = "魔法师使用了魔法"
|
|
|
+ glossary = {"魔法": "Magic", "魔法师": "Mage"}
|
|
|
+ # 应该匹配 "魔法师" 而不是 "魔法"
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_non_overlapping_matches(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_placeholder_generation(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+class TestGlossaryPreprocessor:
|
|
|
+ def test_process_text_with_terms(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_retention_rate_calculation(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_batch_processing(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+class TestGlossaryPostprocessor:
|
|
|
+ def test_restore_from_placeholder(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_fix_punctuation(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_validate_translation_success(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_validate_translation_missing_terms(self):
|
|
|
+ pass
|
|
|
+
|
|
|
+class TestGlossaryIntegration:
|
|
|
+ """端到端集成测试"""
|
|
|
+
|
|
|
+ def test_full_pipeline(self):
|
|
|
+ """测试完整流程"""
|
|
|
+ # 1. 创建术语表
|
|
|
+ # 2. 预处理文本
|
|
|
+ # 3. 模拟翻译
|
|
|
+ # 4. 后处理文本
|
|
|
+ # 5. 验证结果
|
|
|
+ original = "林风释放了火球术"
|
|
|
+ glossary = Glossary()
|
|
|
+ glossary.add(GlossaryEntry("林风", "Lin Feng", "CHARACTER"))
|
|
|
+ glossary.add(GlossaryEntry("火球术", "Fireball", "SKILL"))
|
|
|
+
|
|
|
+ preprocessor = GlossaryPreprocessor(glossary)
|
|
|
+ result = preprocessor.process(original)
|
|
|
+
|
|
|
+ # 模拟翻译(保留占位符)
|
|
|
+ mock_translated = "__en__林风 released __en__火球术"
|
|
|
+
|
|
|
+ postprocessor = GlossaryPostprocessor()
|
|
|
+ final = postprocessor.process(mock_translated, result.placeholder_map)
|
|
|
+
|
|
|
+ assert final == "Lin Feng released Fireball"
|
|
|
+ pass
|
|
|
+
|
|
|
+ def test_phase_0_validation_scenario(self):
|
|
|
+ """测试 Phase 0 验证场景"""
|
|
|
+ # 无术语表: "林风" → "Lin wind"
|
|
|
+ # 有术语表: "林风" → "Lin Feng"
|
|
|
+ pass
|
|
|
+```
|
|
|
+
|
|
|
+**技术任务**:
|
|
|
+1. 创建 `tests/test_glossary.py`
|
|
|
+2. 实现所有单元测试
|
|
|
+3. 实现集成测试
|
|
|
+4. 运行覆盖率报告
|
|
|
+5. 确保覆盖率 >= 90%
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Phase 2 Story (推迟)
|
|
|
+
|
|
|
+### Story 4.5: 实现上下文标注
|
|
|
+
|
|
|
+**估算**: 5 SP
|
|
|
+**状态**: 推迟到 Phase 2
|
|
|
+
|
|
|
+**描述**: 为术语标注上下文,帮助用户确定合适的翻译。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 文件结构
|
|
|
+
|
|
|
+```
|
|
|
+src/
|
|
|
+└── glossary/
|
|
|
+ ├── __init__.py
|
|
|
+ ├── models.py # GlossaryEntry, Glossary 类
|
|
|
+ ├── matcher.py # GlossaryMatcher 类
|
|
|
+ ├── preprocessor.py # GlossaryPreprocessor 类
|
|
|
+ └── postprocessor.py # GlossaryPostprocessor 类
|
|
|
+
|
|
|
+tests/
|
|
|
+└── test_glossary.py # 所有术语表测试
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Phase 0 验证数据
|
|
|
+
|
|
|
+| 测试场景 | 原文 | 无术语表结果 | 有术语表结果 |
|
|
|
+|---------|------|------------|------------|
|
|
|
+| 角色名翻译 | 林风 | Lin wind ❌ | Lin Feng ✅ |
|
|
|
+| 产品名称 | BMAD | BMAd ❌ | BMAD ✅ |
|
|
|
+| 技能名称 | 火球术 | fire ball ❌ | Fireball ✅ |
|
|
|
+| 保留率测试 | 14个术语 | 0% | 93.4% ✅ |
|
|
|
+
|
|
|
+**结论**: 术语表功能是**必须的**,没有它翻译内容不可用。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 依赖关系
|
|
|
+
|
|
|
+- Epic 4 无外部依赖,可独立开发
|
|
|
+- Epic 5 (翻译模块) 将使用 Epic 4 的预处理和后处理功能
|
|
|
+- 可与 Epic 1 部分并行开发
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 完成标准
|
|
|
+
|
|
|
+- [x] 所有 5 个核心 Story 完成
|
|
|
+- [x] 单元测试覆盖率 >= 90%
|
|
|
+- [x] 集成测试通过
|
|
|
+- [x] Phase 0 验证场景测试通过
|
|
|
+- [x] 代码审查通过
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 下一步
|
|
|
+
|
|
|
+完成 Epic 4 核心功能后,与 Epic 1 集成,开始端到端测试。
|