--- stepsCompleted: [1, 2, 3, 4, 5] inputDocuments: ['prd.md'] workflowType: 'architecture' project_name: '223-236-template-6' user_name: 'User' date: '2026-03-13' status: 'complete' version: '1.0' --- # Architecture Decision Document _This document builds collaboratively through step-by-step discovery. Sections are appended as we work through each architectural decision together._ ## Project Context Analysis ### Requirements Overview **Functional Requirements:** 项目包含 52 个功能需求,分为六个核心模块: 1. **指纹模块 (FR1-FR8)**: 章节指纹查重,支持批量检测与人工审核 2. **清洗模块 (FR9-FR16)**: 正则替换规则引擎、格式标准化、HTML/Markdown 处理 3. **术语模块 (FR17-FR24)**: 术语库管理、智能提取、锁定标记 (§Ti§) 4. **翻译模块 (FR25-FR33)**: M2M100 模型推理、GPU/CPU 自适应、批处理优化 5. **上传模块 (FR34-FR40)**: 平台 API 对接、失败重试、CU 扣费 6. **任务调度器 (FR41-FR47)**: 流水线编排、并发控制、断点续传 7. **系统集成 (FR48-FR52)**: 配置管理、日志系统、版本检测 **Non-Functional Requirements:** | 类别 | 关键要求 | 架构影响 | |------|----------|----------| | 性能 | 3000-5000 词/分钟 (RTX 3060) | 需要批处理优化、GPU 内存管理 | | 可靠性 | Crash-Safe 原子写 | 所有持久化操作需使用 .tmp + fsync + rename 模式 | | 安全性 | 零数据泄露 | 全流程本地处理,禁止数据外传 | | 兼容性 | NVIDIA GTX 1650+ (4GB+ VRAM) | 需要优雅的 GPU 降级策略 | | 许可证 | 零授权费依赖 | 所有依赖必须为标准库或 MIT 协议 | **Scale & Complexity:** - Primary domain: Desktop Application + AI Inference - Complexity level: Medium - Estimated architectural components: 7 major components ### Technical Constraints & Dependencies **硬约束:** - 必须使用 MIT 协议库(排除 GPL 污染) - 必须 100% 本地处理(无云 API 调用) - 必须支持 Crash-Safe 原子写 **外部依赖:** - CTranslate2 (MIT): 模型推理引擎 - facebook/m2m100_418M: 翻译模型 - PyQt6: GUI 框架 - PyTorch (CUDA): GPU 加速 **集成接口:** - 指纹查重 API: POST /api/fingerprint/check - 平台上传 API: 章节提交接口 - CU 扣费 API: 按字数计费 ### Cross-Cutting Concerns Identified 1. **Crash-Safe 持久化**: 影响所有写操作(进度、清洗结果、翻译结果) 2. **GPU 资源管理**: 翻译模块独占,需协调与其他模块的并发 3. **术语一致性**: 术语锁定机制需跨越清洗→翻译流程传递 4. **进度可见性**: 六个阶段进度需统一展示 5. **错误恢复**: 每个模块的失败处理与断点续传 6. **许可证合规性**: 所有新增依赖需验证许可证类型 ## Starter Template Evaluation ### Primary Technology Domain **Python Desktop Application** (PyQt6 + CTranslate2 GPU Inference) 基于项目需求分析,这是一个本地桌面应用,需要: - GUI框架:PyQt6 - AI推理引擎:CTranslate2 (GPU加速) - 系统集成:文件I/O、网络API调用 ### Starter Options Considered 由于Python桌面应用领域没有统一的"启动模板"生态系统,我们评估了以下选项: | 选项 | 优点 | 缺点 | 适用性 | |------|------|------|--------| | **从零构建** | 完全控制,无技术债 | 需要手动配置所有工具 | ✅ 推荐 - 特定需求较多 | | **Python Boilerplate** | 标准结构,包含测试/代码质量 | 针对Web/服务端优化 | ⚠️ 部分适用 | | **Cookiecutter模板** | 快速启动,最佳实践 | 需要定制化修改 | ⚠️ 部分适用 | ### Selected Approach: 自定义项目结构 (基于2025年最佳实践) **Rationale for Selection:** 本项目有以下独特约束,标准模板无法满足: 1. **Crash-Safe 原子写机制**:需要在所有持久化点实现 2. **GPU 资源管理**:CTranslate2 需要特定配置 3. **零授权费约束**:需要严格验证所有依赖的许可证 4. **六模块流水线架构**:需要特定的模块划分 **项目初始化命令:** ```bash # 1. 创建项目目录结构 mkdir -p xling-matrix-assistant/src/xling_matrix/{core,modules,ui,infrastructure} mkdir -p xling-matrix-assistant/tests/{unit,integration} mkdir -p xling-matrix-assistant/{data,models,logs,docs} # 2. 创建虚拟环境 python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # 3. 安装核心依赖 pip install PyQt6 ctranslate2 torch numpy requests pyyaml # 4. 安装开发工具 pip install pytest pytest-qt pytest-cov black ruff mypy ``` **Architectural Decisions Established:** **Language & Runtime:** - Python 3.11+ (类型注解支持,性能优化) - 类型检查:mypy (严格模式) - 代码格式化:black - 代码检查:ruff **项目结构 (src layout):** ``` xling-matrix-assistant/ ├── src/ │ └── xling_matrix/ │ ├── __init__.py │ ├── __main__.py # 应用入口点 │ ├── core/ # 核心领域模型 │ │ ├── __init__.py │ │ ├── models.py # 数据模型 │ │ ├── state.py # 状态机 │ │ └── pipeline.py # 流水线编排 │ ├── modules/ # 六大核心模块 │ │ ├── __init__.py │ │ ├── fingerprint/ # FR1-FR8 │ │ ├── cleaning/ # FR9-FR16 │ │ ├── terminology/ # FR17-FR24 │ │ ├── translation/ # FR25-FR33 │ │ └── upload/ # FR34-FR40 │ ├── ui/ # PyQt6 GUI │ │ ├── __init__.py │ │ ├── main_window.py │ │ ├── widgets/ # 自定义控件 │ │ └── dialogs/ # 对话框 │ └── infrastructure/ # 基础设施层 │ ├── __init__.py │ ├── storage.py # Crash-Safe 持久化 │ ├── gpu_manager.py # GPU 资源管理 │ ├── api_client.py # 外部 API 客户端 │ └── logger.py # 日志系统 ├── tests/ │ ├── unit/ │ └── integration/ ├── models/ # 翻译模型存储 ├── data/ # 用户数据目录 ├── logs/ # 日志目录 ├── pyproject.toml # 项目配置 ├── pyproject.toml # 打包配置 └── README.md ``` **Build Tooling & Packaging:** ```toml # pyproject.toml [project] name = "xling-matrix-assistant" version = "0.1.0" requires-python = ">=3.11" dependencies = [ "PyQt6>=6.6.0", "ctranslate2>=4.0.0", "torch>=2.1.0", "numpy>=1.24.0", "requests>=2.31.0", "pyyaml>=6.0.0", ] [project.optional-dependencies] dev = ["pytest>=7.4.0", "pytest-qt>=4.2.0", "pytest-cov>=4.1.0", "black>=23.12.0", "ruff>=0.1.0", "mypy>=1.7.0"] [project.scripts] xling-matrix = "xling_matrix.__main__:main" [tool.black] line-length = 100 target-version = ["py311"] [tool.ruff] line-length = 100 select = ["E", "F", "I", "N", "W"] [tool.mypy] python_version = "3.11" strict = true ``` **Testing Framework:** - pytest (单元测试) - pytest-qt (PyQt6 测试工具) - pytest-cov (覆盖率报告) **Development Experience:** - 虚拟环境隔离 - 类型检查 (mypy strict) - 即时重载 (开发模式) - 调试配置 (VS Code / PyCharm) **GPU Inference Configuration (CTranslate2):** ```python # 推荐配置 import ctranslate2 translator = ctranslate2.Translator( "models/m2m100_418m_ct2/", device="cuda", # GPU 加速 device_index=0, # 主 GPU compute_type="float16", # Tensor Core 优化 inter_threads=4, # 并发批处理 ) # 批处理优化 batch_size = 16 # 根据显存调整 (RTX 3060: 16-32) ``` **Note:** 项目初始化应作为第一个实现故事执行。 ## Core Architectural Decisions ### Decision Priority Analysis **Critical Decisions (Block Implementation):** 1. **Crash-Safe 原子写机制**: 采用 .tmp + fsync + rename 模式,所有持久化操作必须遵循 2. **数据文件格式**: 使用 JSON 格式存储进度、清洗结果、翻译结果、术语库 3. **GPU 推理配置**: CTranslate2 + float16 + 批处理优化 4. **六模块流水线架构**: Fingerprint → Cleaning → Terminology → Translation → Upload **Important Decisions (Shape Architecture):** 1. **PyQt6 ModelView 架构**: 使用 Qt Model/View 分离,实现数据驱动UI更新 2. **Repository 模式**: 抽象数据持久化层,统一 Crash-Safe 机制 3. **Observer 模式**: 进度事件通知机制,解耦业务逻辑与UI 4. **打包策略**: PyInstaller 打包为可执行文件 **Deferred Decisions (Post-MVP):** 1. **自动更新机制**: Growth 阶段功能,使用第三方库 (如 PyUpdater) 2. **插件系统**: Vision 阶段功能,允许扩展自定义模块 3. **云同步**: Vision 阶段功能,可选的云端备份 ### Data Architecture **数据存储策略:** | 数据文件 | 格式 | 访问模式 | Crash-Safe 实现 | |---------|------|-----------|----------------| | progress.json | JSON | 读写频繁 | 原子替换 + 锁机制 | | novel_cleaned.json | JSON | 写入一次 | 原子写入 | | terms_temp.json | JSON | 读写频繁 | 原子替换 + 锁机制 | | novel_translated.json | JSON | 写入一次 | 原子写入 | | upload_failed.jsonl | JSONL | 追加写入 | 原子追加 + 锁机制 | | terms_library.json | JSON | 读写频繁 | 原子替换 + 锁机制 | **数据验证策略:** - **Pydantic 模型**: 定义数据模型的类型约束 - **运行时验证**: 所有外部输入必须经过验证 - **Schema 迁移**: 版本化数据格式,支持自动升级 **Crash-Safe 持久化实现:** ```python # infrastructure/storage.py import os import fcntl class AtomicWriter: """Crash-Safe 原子写工具""" @staticmethod def write(filepath: str, data: dict | str) -> None: tmp_path = f"{filepath}.tmp" # 写入临时文件 with open(tmp_path, 'w', encoding='utf-8') as f: if isinstance(data, dict): json.dump(data, f, ensure_ascii=False, indent=2) else: f.write(data) f.flush() # 强制写入磁盘 os.fsync(f.fileno()) # 强制同步 # 原子重命名 os.replace(tmp_path, filepath) ``` ### Authentication & Security **不适用**: 本地桌面应用,无需认证/授权机制 **数据安全:** - 所有数据 100% 本地存储 - 禁止任何网络数据上传(除平台API上传外) - GPU 模型本地推理,无云端API调用 ### API & Communication Patterns **外部 API 集成:** ```python # infrastructure/api_client.py import requests from typing import Dict, Optional class PlatformAPIClient: """平台 API 客户端""" def __init__(self, base_url: str, api_key: str): self.base_url = base_url self.api_key = api_key self.timeout = 30 # 30秒超时 def check_fingerprint(self, text: str) -> Dict: """指纹查重 API""" response = requests.post( f"{self.base_url}/api/fingerprint/check", json={"text": text}, headers={"Authorization": f"Bearer {self.api_key}"}, timeout=self.timeout ) response.raise_for_status() return response.json() def upload_chapter(self, chapter_data: Dict) -> Dict: """章节上传 API""" response = requests.post( f"{self.base_url}/api/chapters", json=chapter_data, headers={"Authorization": f"Bearer {self.api_key}"}, timeout=self.timeout ) response.raise_for_status() return response.json() def deduct_cu(self, word_count: int) -> Dict: """CU 扣费 API""" response = requests.post( f"{self.base_url}/api/cu/deduct", json={"words": word_count}, headers={"Authorization": f"Bearer {self.api_key}"}, timeout=self.timeout ) response.raise_for_status() return response.json() ``` **重试策略:** - 指数退避重试 - 最大重试次数:3次 - 超时配置:30秒 ### Frontend Architecture **PyQt6 ModelView 架构:** ```python # ui/models/task_model.py from PyQt6.QtCore import QAbstractTableModel, Qt class TaskModel(QAbstractTableModel): """任务数据模型""" def __init__(self): super().__init__() self._tasks = [] def rowCount(self, parent=None): return len(self._tasks) def columnCount(self, parent=None): return 5 # work_id, status, progress, start_time, end_time def data(self, index, role=Qt.ItemDataRole.DisplayRole): if not index.isValid() or role != Qt.ItemDataRole.DisplayRole: return None return self._tasks[index.row()][index.column()] def update_task(self, work_id: str, status: str, progress: int): """更新任务状态""" row = self._find_row(work_id) if row is not None: self._tasks[row]['status'] = status self._tasks[row]['progress'] = progress self.dataChanged.emit(self.index(row, 0), self.index(row, 4)) # ui/main_window.py from PyQt6.QtWidgets import QMainWindow, QTableView from ui.models.task_model import TaskModel class MainWindow(QMainWindow): def __init__(self): super().__init__() self.task_model = TaskModel() self.task_table = QTableView() self.task_table.setModel(self.task_model) ``` **进度通知机制 (Observer 模式):** ```python # core/events.py from PyQt6.QtCore import QObject, pyqtSignal class ProgressEmitter(QObject): """进度事件发射器""" stage_progress = pyqtSignal(str, int) # (work_id, percentage) stage_completed = pyqtSignal(str, str) # (work_id, stage_name) stage_failed = pyqtSignal(str, str, str) # (work_id, stage_name, error) task_finished = pyqtSignal(str) # (work_id) # modules/cleaning/cleaner.py from core.events import ProgressEmitter class TextCleaner: def __init__(self, emitter: ProgressEmitter): self.emitter = emitter def clean(self, text: str, work_id: str) -> str: # 执行清洗 cleaned = self._apply_rules(text) # 发送进度通知 self.emitter.stage_progress.emit(work_id, 100) self.emitter.stage_completed.emit(work_id, "cleaning") return cleaned ``` ### Infrastructure & Deployment **打包策略:** ```python # pyproject.toml [build-system] requires = ["setuptools>=68.0", "wheel", "pyinstaller>=6.0"] build-backend = "setuptools.build_meta" [tool.pyinstaller] name = "序灵Matrix助手" console = true onefile = true icon = "assets/icon.ico" add-data = [ ("models/*", "models/"), ("assets/*", "assets/") ] hiddenimports = [ "PyQt6.sip", "ctranslate2" ] ``` **环境配置:** | 配置项 | 位置 | 说明 | |--------|------|------| | 配置文件 | `~/.config/xling-matrix/config.yaml` | API密钥、GPU设置 | | 数据目录 | `~/Documents/xling-matrix/` | 输入/输出文件 | | 日志目录 | `~/Documents/xling-matrix/logs/` | 运行日志 | | 模型目录 | `~/.local/share/xling-matrix/models/` | 翻译模型 | **日志系统:** ```python # infrastructure/logger.py import logging from pathlib import Path def setup_logger(name: str, log_dir: Path) -> logging.Logger: logger = logging.getLogger(name) logger.setLevel(logging.INFO) # 文件处理器 file_handler = logging.FileHandler( log_dir / f"{name}.log", encoding='utf-8' ) file_handler.setFormatter( logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') ) logger.addHandler(file_handler) # 控制台处理器 console_handler = logging.StreamHandler() console_handler.setFormatter( logging.Formatter('%(levelname)s: %(message)s') ) logger.addHandler(console_handler) return logger ``` ### Decision Impact Analysis **Implementation Sequence:** 1. Crash-Safe 持久化层 → 所有模块的基础 2. PyQt6 ModelView 架构 → UI 层的基础 3. 六个核心模块 → 业务逻辑实现 4. GPU 推理优化 → 性能优化 5. API 集成与上传 → 外部对接 **Cross-Component Dependencies:** ``` ┌─────────────┐ │ GUI UI │ └──────┬──────┘ │ Observer ┌──────▼──────┐ │ Scheduler │ └──────┬──────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │Fingerprint│ │Cleaning │ │Translation│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └────────┬────────┴──────────────────┘ │ ┌───────▼────────┐ │ Storage Layer │ │ (Crash-Safe) │ └─────────────────┘ ``` ## Implementation Patterns & Consistency Rules ### Pattern Categories Defined **Critical Conflict Points Identified:** 8 个领域需要一致性规则以确保 AI 代理代码兼容 ### Core Design Patterns **1. Pipeline 模式(翻译流水线)** 所有翻译任务必须通过统一的 Pipeline 执行: ```python # core/pipeline.py from dataclasses import dataclass from typing import Protocol @dataclass class PipelineContext: """流水线上下文""" work_id: str input_file: str output_dir: str current_stage: str error: str | None = None metadata: dict = None class PipelineStage(Protocol): """流水线阶段协议""" def name(self) -> str: """返回阶段名称""" ... def execute(self, context: PipelineContext) -> PipelineContext: """执行阶段逻辑""" ... class TranslationPipeline: """翻译流水线""" def __init__(self): self.stages: list[PipelineStage] = [] def add_stage(self, stage: PipelineStage) -> None: self.stages.append(stage) def execute(self, context: PipelineContext) -> PipelineContext: for stage in self.stages: context.current_stage = stage.name() try: context = stage.execute(context) if context.error: context.error = f"{stage.name()}: {context.error}" return context except Exception as e: context.error = f"{stage.name()}: {str(e)}" return context return context ``` **2. State Machine(任务状态)** 任务状态转换必须遵循状态机规则: ```python # core/state.py from enum import Enum from dataclasses import dataclass class TaskState(Enum): """任务状态枚举""" PENDING = "pending" RUNNING = "running" PAUSED = "paused" SUCCESS = "success" FAILED = "failed" @dataclass class TaskTransition: """状态转换""" from_state: TaskState to_state: TaskState is_valid: bool error: str | None = None class TaskStateMachine: """任务状态机""" # 允许的状态转换 VALID_TRANSITIONS = { TaskState.PENDING: [TaskState.RUNNING], TaskState.RUNNING: [TaskState.PAUSED, TaskState.SUCCESS, TaskState.FAILED], TaskState.PAUSED: [TaskState.RUNNING, TaskState.FAILED], TaskState.SUCCESS: [], # 终态 TaskState.FAILED: [TaskState.PENDING], # 可重试 } def can_transition(self, from_state: TaskState, to_state: TaskState) -> bool: return to_state in self.VALID_TRANSITIONS.get(from_state, []) def transition(self, current: TaskState, target: TaskState) -> TaskTransition: if not self.can_transition(current, target): valid_targets = ", ".join(s.value for s in self.VALID_TRANSITIONS.get(current, [])) return TaskTransition(current, target, False, f"Invalid transition: {current.value} -> {target.value}. Valid targets: {valid_targets}") return TaskTransition(current, target, True) ``` **3. Repository 模式(数据持久化)** 所有数据访问必须通过 Repository 接口: ```python # core/repository.py from abc import ABC, abstractmethod from typing import TypeVar, Generic T = TypeVar('T') class Repository(ABC, Generic[T]): """Repository 接口""" @abstractmethod def save(self, entity: T) -> None: """保存实体""" pass @abstractmethod def load(self, id: str) -> T | None: """加载实体""" pass # infrastructure/repositories/progress_repository.py from core.repository import Repository from infrastructure.storage import AtomicWriter class CrashSafeProgressRepository(Repository[Progress]): """Crash-Safe 进度仓储""" def __init__(self, file_path: str): self.file_path = file_path def save(self, progress: Progress) -> None: AtomicWriter.write(self.file_path, progress.to_dict()) def load(self, work_id: str) -> Progress | None: if not os.path.exists(self.file_path): return None with open(self.file_path, 'r', encoding='utf-8') as f: data = json.load(f) return Progress.from_dict(data.get(work_id)) ``` **4. Observer 模式(进度通知)** 使用 PyQt6 信号槽机制实现进度通知: ```python # core/events.py from PyQt6.QtCore import QObject, pyqtSignal from typing import Protocol class ProgressObserver(Protocol): """进度观察者协议""" def on_stage_start(self, work_id: str, stage: str) -> None: """阶段开始""" ... def on_stage_progress(self, work_id: str, stage: str, percent: int) -> None: """阶段进度""" ... def on_stage_complete(self, work_id: str, stage: str) -> None: """阶段完成""" ... def on_stage_error(self, work_id: str, stage: str, error: str) -> None: """阶段错误""" ... class ProgressEmitter(QObject): """进度事件发射器""" # 定义信号 stage_started = pyqtSignal(str, str) # (work_id, stage) stage_progress = pyqtSignal(str, str, int) # (work_id, stage, percent) stage_completed = pyqtSignal(str, str) # (work_id, stage) stage_failed = pyqtSignal(str, str, str) # (work_id, stage, error) task_finished = pyqtSignal(str, str) # (work_id, final_state) # 使用示例 class TranslationStage: def __init__(self, emitter: ProgressEmitter): self.emitter = emitter def execute(self, context: PipelineContext) -> PipelineContext: self.emitter.stage_started.emit(context.work_id, "translation") try: for i, batch in enumerate(batches): # 执行翻译 self._translate_batch(batch) progress = int((i + 1) / len(batches) * 100) self.emitter.stage_progress.emit(context.work_id, "translation", progress) self.emitter.stage_completed.emit(context.work_id, "translation") return context except Exception as e: self.emitter.stage_failed.emit(context.work_id, "translation", str(e)) context.error = str(e) return context ``` ### Naming Patterns **代码命名约定:** | 类别 | 约定 | 示例 | |------|------|------| | 类名 | PascalCase | `TranslationPipeline`, `TaskStateMachine` | | 函数名 | snake_case | `execute_pipeline()`, `load_progress()` | | 变量名 | snake_case | `work_id`, `batch_size` | | 常量 | UPPER_SNAKE_CASE | `MAX_BATCH_SIZE`, `DEFAULT_TIMEOUT` | | 私有成员 | 前缀下划线 | `_internal_state`, `_helper()` | | 协议/接口 | PascalCase + Protocol 后缀 | `ProgressObserver`, `Repository` | **文件命名约定:** | 类型 | 命名 | 示例 | |------|------|------| | 模块文件 | snake_case.py | `translation_stage.py`, `progress_repository.py` | | 测试文件 | test_.py | `test_pipeline.py`, `test_translation.py` | | 包目录 | snake_case | `translation/`, `cleaning/` | ### Structure Patterns **项目组织原则:** ``` src/xling_matrix/ ├── core/ # 核心领域模型(无依赖) │ ├── models.py # 数据模型 │ ├── state.py # 状态机 │ ├── pipeline.py # 流水线 │ ├── events.py # 事件系统 │ └── repository.py # Repository 接口 │ ├── modules/ # 业务模块(依赖 core) │ └── / │ ├── __init__.py │ ├── _stage.py # 阶段实现 │ ├── _service.py # 服务逻辑 │ └── models.py # 模块特定模型 │ ├── ui/ # UI 层(依赖 core) │ ├── main_window.py │ ├── widgets/ │ └── dialogs/ │ └── infrastructure/ # 基础设施(可依赖任何层) ├── storage/ ├── gpu/ ├── network/ └── logging/ ``` **测试组织原则:** ``` tests/ ├── unit/ # 单元测试 │ ├── test_core/ │ │ ├── test_pipeline.py │ │ ├── test_state.py │ │ └── test_events.py │ └── test_modules/ │ ├── test_translation.py │ └── test_cleaning.py │ ├── integration/ # 集成测试 │ ├── test_workflow_integration.py │ └── test_api_integration.py │ └── fixtures/ # 测试数据 ├── sample_novels/ └── expected_outputs/ ``` ### Format Patterns **数据文件格式:** 所有 JSON 文件必须遵循以下格式: ```python # 通用 JSON 结构 { "version": "1.0", # 数据版本 "work_id": "uuid", # 工作ID "timestamp": "ISO-8601", # 时间戳 "data": { ... } # 实际数据 } ``` **进度文件格式 (progress.json):** ```json { "version": "1.0", "work_id": "abc123", "state": "running", "current_stage": "translation", "stages": { "fingerprint": {"status": "success", "progress": 100}, "cleaning": {"status": "success", "progress": 100}, "terminology": {"status": "success", "progress": 100}, "translation": {"status": "running", "progress": 45}, "upload": {"status": "pending", "progress": 0} }, "created_at": "2026-03-13T12:00:00Z", "updated_at": "2026-03-13T12:30:00Z" } ``` **错误响应格式:** ```python # 统一错误格式 @dataclass class ErrorInfo: code: str # 错误代码 (如 "STAGE_FAILED", "GPU_OOM") message: str # 用户友好的错误消息 detail: str | None # 详细错误信息(日志级别) stage: str | None # 失败的阶段 # 错误代码规范 class ErrorCode: # 通用错误 UNKNOWN_ERROR = "UNKNOWN_ERROR" INVALID_INPUT = "INVALID_INPUT" FILE_NOT_FOUND = "FILE_NOT_FOUND" # 阶段错误 FINGERPRINT_FAILED = "FINGERPRINT_FAILED" CLEANING_FAILED = "CLEANING_FAILED" TERMINOLOGY_FAILED = "TERMINOLOGY_FAILED" TRANSLATION_FAILED = "TRANSLATION_FAILED" UPLOAD_FAILED = "UPLOAD_FAILED" # GPU 错误 GPU_NOT_AVAILABLE = "GPU_NOT_AVAILABLE" GPU_OOM = "GPU_OOM" # 网络错误 API_CONNECTION_FAILED = "API_CONNECTION_FAILED" API_TIMEOUT = "API_TIMEOUT" ``` ### Communication Patterns **事件命名约定:** ```python # 事件命名格式: _ class Events: # 阶段事件 STAGE_STARTED = "stage.started" STAGE_PROGRESS = "stage.progress" STAGE_COMPLETED = "stage.completed" STAGE_FAILED = "stage.failed" # 任务事件 TASK_CREATED = "task.created" TASK_STARTED = "task.started" TASK_PAUSED = "task.paused" TASK_RESUMED = "task.resumed" TASK_FINISHED = "task.finished" ``` **日志级别使用:** | 级别 | 用途 | 示例 | |------|------|------| | DEBUG | 详细调试信息 | `"Batch size: 16, GPU memory: 3.2GB"` | | INFO | 正常操作流程 | `"Stage 'translation' started for work_id: abc123"` | | WARNING | 可恢复的问题 | `"GPU memory low, reducing batch size to 8"` | | ERROR | 操作失败但可恢复 | `"API request failed, retrying (1/3)"` | | CRITICAL | 严重错误需人工介入 | `"GPU OOM, cannot continue"` | ### Process Patterns **Crash-Safe 写入模式(强制执行):** ```python # 所有持久化操作必须使用此模式 from infrastructure.storage import AtomicWriter # 正确示例 def save_progress(progress: Progress) -> None: AtomicWriter.write("progress.json", progress.to_dict()) # 错误示例 - 禁止直接写入 def save_progress_WRONG(progress: Progress) -> None: with open("progress.json", "w") as f: json.dump(progress.to_dict(), f) # ❌ 非 Crash-Safe ``` **错误处理模式:** ```python # 统一错误处理流程 def execute_stage(context: PipelineContext) -> PipelineContext: try: # 业务逻辑 result = do_work(context) return context except GPUOutOfMemoryError as e: # 特定错误处理 return handle_gpu_oom(context, e) except APIError as e: # 重试逻辑 return retry_with_backoff(context, e) except Exception as e: # 通用错误处理 context.error = str(e) logger.error(f"Stage failed: {e}", exc_info=True) return context ``` **GPU 资源管理模式:** ```python # infrastructure/gpu/manager.py import ctranslate2 from typing import ContextManager class GPUManager: """GPU 资源管理器""" _instance = None _translator = None @classmethod def get_instance(cls) -> 'GPUManager': if cls._instance is None: cls._instance = cls() return cls._instance def initialize(self, model_path: str) -> None: """初始化 GPU 翻译器""" if self._translator is None: self._translator = ctranslate2.Translator( model_path, device=self._detect_device(), device_index=0, compute_type="float16", inter_threads=4 ) def _detect_device(self) -> str: """检测可用设备""" try: import torch if torch.cuda.is_available(): return "cuda" except: pass return "cpu" # 降级到 CPU def translate_batch(self, tokens: list[list[str]]) -> list[list[str]]: """执行批处理翻译""" return self._translator.translate_batch(tokens) ``` ### Enforcement Guidelines **All AI Agents MUST:** 1. **使用 Crash-Safe 写入**: 所有持久化操作必须通过 `AtomicWriter` 2. **遵循状态机规则**: 状态转换必须通过 `TaskStateMachine` 验证 3. **使用 Repository 接口**: 数据访问必须实现 `Repository` 协议 4. **通过信号通知进度**: 使用 `ProgressEmitter` 发送进度事件 5. **遵循命名约定**: 代码命名必须符合定义的约定 6. **返回统一错误格式**: 所有错误必须返回 `ErrorInfo` 结构 ## Complete Project Structure ### Directory Layout ``` xling-matrix-assistant/ ├── src/ │ └── xling_matrix/ │ ├── __init__.py │ ├── __main__.py │ │ │ ├── core/ # 核心领域层 │ │ ├── __init__.py │ │ ├── models.py # 数据模型定义 │ │ ├── state.py # 状态机实现 │ │ ├── pipeline.py # 流水线编排 │ │ ├── events.py # 事件系统 │ │ └── repository.py # Repository 接口 │ │ │ ├── modules/ # 业务模块层 │ │ ├── __init__.py │ │ │ │ │ ├── fingerprint/ # 指纹模块 (FR1-FR8) │ │ │ ├── __init__.py │ │ │ ├── fingerprint_stage.py # 指纹查重阶段 │ │ │ ├── fingerprint_service.py # 指纹服务 │ │ │ └── models.py # 指纹数据模型 │ │ │ │ │ ├── cleaning/ # 清洗模块 (FR9-FR16) │ │ │ ├── __init__.py │ │ │ ├── cleaning_stage.py │ │ │ ├── rule_engine.py # 正则替换引擎 │ │ │ ├── formatter.py # 格式标准化 │ │ │ └── models.py │ │ │ │ │ ├── terminology/ # 术语模块 (FR17-FR24) │ │ │ ├── __init__.py │ │ │ ├── terminology_stage.py │ │ │ ├── extractor.py # 术语提取器 │ │ │ ├── library.py # 术语库管理 │ │ │ └── models.py │ │ │ │ │ ├── translation/ # 翻译模块 (FR25-FR33) │ │ │ ├── __init__.py │ │ │ ├── translation_stage.py │ │ │ ├── translator.py # CTranslate2 封装 │ │ │ ├── batch_processor.py # 批处理优化 │ │ │ └── models.py │ │ │ │ │ └── upload/ # 上传模块 (FR34-FR40) │ │ ├── __init__.py │ │ ├── upload_stage.py │ │ ├── uploader.py # 平台上传 │ │ └── models.py │ │ │ ├── ui/ # 表示层 │ │ ├── __init__.py │ │ ├── main_window.py # 主窗口 │ │ ├── widgets/ │ │ │ ├── __init__.py │ │ │ ├── task_list_widget.py # 任务列表 │ │ │ ├── progress_widget.py # 进度显示 │ │ │ └── log_widget.py # 日志显示 │ │ ├── dialogs/ │ │ │ ├── __init__.py │ │ │ ├── new_task_dialog.py # 新建任务对话框 │ │ │ ├── settings_dialog.py # 设置对话框 │ │ │ └── fingerprint_dialog.py # 指纹审核对话框 │ │ └── models/ │ │ ├── __init__.py │ │ └── task_model.py # 任务数据模型 │ │ │ └── infrastructure/ # 基础设施层 │ ├── __init__.py │ ├── storage/ │ │ ├── __init__.py │ │ ├── atomic_writer.py # Crash-Safe 写入 │ │ └── file_lock.py # 文件锁机制 │ ├── gpu/ │ │ ├── __init__.py │ │ └── manager.py # GPU 资源管理 │ ├── network/ │ │ ├── __init__.py │ │ ├── api_client.py # 平台 API 客户端 │ │ └── retry.py # 重试策略 │ └── logging/ │ ├── __init__.py │ └── logger.py # 日志配置 │ ├── tests/ │ ├── __init__.py │ ├── conftest.py # pytest 配置 │ │ │ ├── unit/ │ │ ├── test_core/ │ │ │ ├── __init__.py │ │ │ ├── test_pipeline.py │ │ │ ├── test_state.py │ │ │ └── test_events.py │ │ ├── test_modules/ │ │ │ ├── test_fingerprint.py │ │ │ ├── test_cleaning.py │ │ │ ├── test_terminology.py │ │ │ ├── test_translation.py │ │ │ └── test_upload.py │ │ └── test_infrastructure/ │ │ ├── test_storage.py │ │ ├── test_gpu_manager.py │ │ └── test_api_client.py │ │ │ ├── integration/ │ │ ├── __init__.py │ │ ├── test_workflow.py # 完整工作流测试 │ │ └── test_api_integration.py │ │ │ └── fixtures/ │ ├── novels/ │ │ └── sample_chinese.txt │ └── expected/ │ └── sample_translated.json │ ├── models/ # 翻译模型文件 │ └── m2m100_418m_ct2/ │ ├── assets/ # 资源文件 │ ├── icons/ │ │ └── app_icon.ico │ └── config/ │ └── default_config.yaml │ ├── docs/ │ ├── architecture.md # 架构文档 │ ├── api.md # API 文档 │ └── user_guide.md # 用户指南 │ ├── pyproject.toml # 项目配置 ├── README.md ├── LICENSE └── .gitignore ``` ### Module Dependencies ``` ┌─────────────────────────────────────────────────────────────┐ │ UI Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ MainWindow │ │ Widgets │ │ Dialogs │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ └─────────┼──────────────────┼──────────────────┼─────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Application Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Scheduler │ │ Workflows │ │ State Machine│ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ └─────────┼──────────────────┼──────────────────┼─────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Domain Layer │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────┐ │ │ │Fingerprint │ │ Cleaning │ │Terminology │ │Translation│ │ │ └────┬───────┘ └────┬───────┘ └────┬───────┘ └────┬────┘ │ │ └────────────────┴────────────────┴────────┘ │ │ │ │ │ ┌─────▼─────┐ │ │ │ Core │ │ │ │(Pipeline, │ │ │ │ Events, │ │ │ │ Models) │ │ │ └───────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Infrastructure Layer │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────┐ │ │ │ Storage │ │ GPU │ │ Network │ │ Logging │ │ │ │(Crash-Safe)│ │ Manager │ │ API Client│ │ │ │ │ └────────────┘ └────────────┘ └────────────┘ └─────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ## API Interface Design ### Internal Module APIs **Pipeline Stage 接口:** ```python # core/pipeline.py class PipelineStage(Protocol): """流水线阶段协议 - 所有阶段必须实现""" def name(self) -> str: """返回阶段唯一标识符""" ... def execute(self, context: PipelineContext) -> PipelineContext: """执行阶段逻辑 Args: context: 流水线上下文 Returns: 更新后的上下文。如果失败,设置 context.error """ ... def estimate_progress(self, context: PipelineContext) -> int: """估算当前进度百分比""" ... ``` **Repository 接口:** ```python # core/repository.py class Repository(ABC, Generic[T]): """数据仓储接口 - 所有数据访问必须实现""" @abstractmethod def save(self, entity: T) -> None: """保存实体(使用 Crash-Safe 写入)""" pass @abstractmethod def load(self, id: str) -> T | None: """加载实体""" pass @abstractmethod def delete(self, id: str) -> bool: """删除实体""" pass ``` ### External API Integration **平台 API 客户端接口:** ```python # infrastructure/network/platform_api.py class PlatformAPIClient: """平台 API 客户端 - 对接外部平台""" BASE_URL: str API_KEY: str TIMEOUT: int = 30 # 指纹查重 API def check_fingerprint(self, text: str) -> FingerprintResult: """检查文本指纹 Args: text: 待检查的文本 Returns: FingerprintResult: 包含相似度、匹配章节等 Raises: APIConnectionError: 网络连接失败 APITimeoutError: 请求超时 APIError: API 返回错误 """ ... # 章节上传 API def upload_chapter(self, chapter: ChapterData) -> UploadResult: """上传翻译章节 Args: chapter: 章节数据(标题、内容、字数等) Returns: UploadResult: 包含章节 ID、URL 等 Raises: APIConnectionError, APITimeoutError, APIError """ ... # CU 扣费 API def deduct_cu(self, word_count: int) -> DeductResult: """扣除 CU Args: word_count: 字数 Returns: DeductResult: 包含剩余 CU """ ... # 健康检查 def health_check(self) -> bool: """检查 API 连接状态""" ... ``` **API 请求/响应格式:** **1. 指纹查重 API** ```http POST /api/fingerprint/check Content-Type: application/json Authorization: Bearer {api_key} # Request { "fingerprint": "md5hash", "sample": "第一章样本文本...", "work_id": "uuid" } # Response { "exists": false, "work_id": "uuid", "similarity": 0.0, "matches": [] } ``` **2. 上传章节 API** ```http POST /api/chapters Content-Type: application/json Authorization: Bearer {api_key} # Request { "work_id": "uuid", "chapter_id": "Chapter 0001", "title": "第一章 开始", "content_en": "Chapter 1 The Beginning...", "word_count": 1234, "source_language": "zh", "target_language": "en" } # Response { "success": true, "chapter_id": "Chapter 0001", "chapter_url": "https://platform.com/novels/uuid/chapters/Chapter%200001", "uploaded_at": "2026-03-13T12:00:00Z" } ``` **3. CU 扣费 API** ```http POST /api/cu/deduct Content-Type: application/json Authorization: Bearer {api_key} # Request { "work_id": "uuid", "words": 1234, "chapter_id": "Chapter 0001" } # Response { "success": true, "deducted": 12.34, "balance": 987.66, "transaction_id": "txn_abc123" } ``` **错误响应格式:** ```http # Error Response { "error": { "code": "INVALID_API_KEY", "message": "API密钥无效或已过期", "detail": "请联系客服获取新的API密钥" } } ``` ## Data Model Design ### Core Data Models ```python # core/models.py from dataclasses import dataclass, field from datetime import datetime from typing import Literal from enum import Enum class TaskState(Enum): PENDING = "pending" RUNNING = "running" PAUSED = "paused" SUCCESS = "success" FAILED = "failed" class StageStatus(Enum): PENDING = "pending" RUNNING = "running" SUCCESS = "success" FAILED = "failed" SKIPPED = "skipped" @dataclass class StageProgress: """阶段进度""" status: StageStatus progress: int # 0-100 error: str | None = None started_at: datetime | None = None completed_at: datetime | None = None @dataclass class Progress: """任务进度""" work_id: str state: TaskState current_stage: str stages: dict[str, StageProgress] = field(default_factory=dict) input_file: str = "" output_dir: str = "" created_at: datetime = field(default_factory=datetime.now) updated_at: datetime = field(default_factory=datetime.now) def to_dict(self) -> dict: """序列化为字典""" return { "version": "1.0", "work_id": self.work_id, "state": self.state.value, "current_stage": self.current_stage, "stages": { name: { "status": stage.status.value, "progress": stage.progress, "error": stage.error, "started_at": stage.started_at.isoformat() if stage.started_at else None, "completed_at": stage.completed_at.isoformat() if stage.completed_at else None, } for name, stage in self.stages.items() }, "input_file": self.input_file, "output_dir": self.output_dir, "created_at": self.created_at.isoformat(), "updated_at": self.updated_at.isoformat(), } @classmethod def from_dict(cls, data: dict) -> 'Progress': """从字典反序列化""" stages = { name: StageProgress( status=StageStatus(stage["status"]), progress=stage["progress"], error=stage.get("error"), started_at=datetime.fromisoformat(stage["started_at"]) if stage.get("started_at") else None, completed_at=datetime.fromisoformat(stage["completed_at"]) if stage.get("completed_at") else None, ) for name, stage in data.get("stages", {}).items() } return cls( work_id=data["work_id"], state=TaskState(data["state"]), current_stage=data["current_stage"], stages=stages, input_file=data.get("input_file", ""), output_dir=data.get("output_dir", ""), created_at=datetime.fromisoformat(data["created_at"]), updated_at=datetime.fromisoformat(data["updated_at"]), ) @dataclass class Term: """术语条目""" source: str # 原文 target: str # 译文 category: str = "" # 分类 locked: bool = False # 是否锁定 @dataclass class TerminologyLibrary: """术语库""" version: str = "1.0" terms: list[Term] = field(default_factory=list) @dataclass class ChapterData: """章节数据""" title: str content: str word_count: int source_language: str = "zh" target_language: str = "en" @dataclass class Chapter: """章节实体""" chapter_id: str # "Chapter 0001" part_index: int # 卷索引 title_src: str # 原文标题 content: str # 原文内容 content_en: str | None = None # 译文内容 word_count: int = 0 translated_at: datetime | None = None @dataclass class Term: """术语条目""" source: str # 原文 translation: str | None # 译文 count: int = 0 # 出现次数 chapters: int = 0 # 涉及章节数 locked: bool = False # 是否锁定 def to_dict(self) -> dict: return { "source": self.source, "translation": self.translation, "count": self.count, "chapters": self.chapters, "locked": self.locked } @classmethod def from_dict(cls, data: dict) -> 'Term': return cls( source=data["source"], translation=data.get("translation"), count=data.get("count", 0), chapters=data.get("chapters", 0), locked=data.get("locked", False) ) ``` ### Extended Data Models **指纹数据模型:** ```python @dataclass class FingerprintData: """指纹查重数据""" work_id: str fingerprint: str # MD5 hash sample: str # 样本文本(前1000字) exists: bool = False similarity: float = 0.0 matches: list[str] = field(default_factory=list) # 匹配的 work_id 列表 @dataclass class FingerprintResult: """指纹查重结果""" exists: bool work_id: str similarity: float matches: list[dict] = field(default_factory=list) # matches format: [{"work_id": "uuid", "similarity": 0.95, "chapter": "Chapter 0001"}] ``` **上传队列模型:** ```python @dataclass class UploadQueueItem: """上传队列项""" work_id: str chapter_id: str title: str content_en: str word_count: int retry_count: int = 0 max_retries: int = 3 created_at: datetime = field(default_factory=datetime.now) @dataclass class UploadFailedItem: """上传失败项(JSONL 格式)""" work_id: str chapter_id: str error_code: str error_message: str failed_at: datetime = field(default_factory=datetime.now) retry_count: int = 0 ``` ## Security Design ### Data Protection **1. 本地数据存储策略:** - 所有用户数据 100% 存储在本地 - 不上传任何原文到云端(除平台 API 上传翻译结果外) - 配置文件(API 密钥)使用系统密钥环存储 **2. API 密钥管理:** ```python # infrastructure/security/secret_manager.py import keyring from typing import Optional class SecretManager: """密钥管理器 - 使用系统密钥环""" SERVICE_NAME = "xling-matrix-assistant" def set_api_key(self, key: str) -> None: """存储 API 密钥""" keyring.set_password(self.SERVICE_NAME, "platform_api", key) def get_api_key(self) -> Optional[str]: """获取 API 密钥""" return keyring.get_password(self.SERVICE_NAME, "platform_api") def delete_api_key(self) -> None: """删除 API 密钥""" keyring.delete_password(self.SERVICE_NAME, "platform_api") ``` **3. 文件权限控制:** ```python # infrastructure/storage/permissions.py import os import stat def set_secure_permissions(filepath: str) -> None: """设置安全的文件权限(仅用户可读写)""" os.chmod(filepath, stat.S_IRUSR | stat.S_IWUSR) ``` ### License Compliance **依赖许可证验证:** ```python # tools/license_checker.py import subprocess import json ALLOWED_LICENSES = {"MIT", "Apache-2.0", "BSD-3-Clause", "PSF-2.0"} BANNED_LICENSES = {"GPL", "AGPL", "LGPL", "SSPL", "CPAL"} def check_dependency_licenses() -> dict: """检查所有依赖的许可证""" result = subprocess.run( ["pip", "show", "--json"], capture_output=True, text=True ) packages = json.loads(result.stdout) issues = [] for pkg in packages: license_ = pkg.get("License", "UNKNOWN") if any(banned in license_ for banned in BANNED_LICENSES): issues.append({ "package": pkg["Name"], "license": license_, "severity": "BLOCKING", "reason": "Contains GPL contamination" }) elif license_ not in ALLOWED_LICENSES and license_ != "UNKNOWN": issues.append({ "package": pkg["Name"], "license": license_, "severity": "WARNING", "reason": "License not in whitelist" }) return {"valid": len(issues) == 0, "issues": issues} ``` **4. 许可证管理(Growth 阶段):** ```python # infrastructure/security/license_manager.py import hashlib import platform import requests class LicenseManager: """许可证管理器 - 硬件绑定与在线激活验证""" def generate_fingerprint(self) -> str: """生成硬件指纹(用于软件绑定)""" # 获取硬件信息 machine_id = platform.node() cpu_info = platform.processor() mac_address = self._get_mac_address() # 组合生成指纹 fingerprint_data = f"{machine_id}:{cpu_info}:{mac_address}" return hashlib.md5(fingerprint_data.encode()).hexdigest() def _get_mac_address(self) -> str: """获取本机 MAC 地址""" try: import uuid return ':'.join(['{:02x}'.format((uuid.getnode() >> elements) & 0xff) for elements in range(0, 2*6, 8)][::-1]) except: return "unknown" def verify_activation(self, activation_key: str) -> bool: """在线验证激活密钥 Args: activation_key: 用户输入的激活密钥 Returns: bool: 激活是否有效 """ try: fingerprint = self.generate_fingerprint() response = requests.post( "https://license.xling-matrix.com/verify", json={ "activation_key": activation_key, "fingerprint": fingerprint, "version": "0.1.0" }, timeout=10 ) response.raise_for_status() return response.json().get("valid", False) except Exception: return False def check_expiration(self, activation_key: str) -> dict | None: """检查激活是否过期""" try: response = requests.post( "https://license.xling-matrix.com/check", json={"activation_key": activation_key}, timeout=10 ) response.raise_for_status() data = response.json() return { "expired": data.get("expired", False), "expires_at": data.get("expires_at"), "days_remaining": data.get("days_remaining") } except Exception: return None def activate_offline(self, activation_key: str, max_credits: int = 1000) -> bool: """离线激活(本地验证签名)""" # TODO: 实现离线激活逻辑(需要服务器生成签名密钥对) return True ``` **激活状态存储:** ```python # infrastructure/storage/license_storage.py import json from pathlib import Path class LicenseStorage: """激活状态存储""" ACTIVATION_FILE = Path.home() / ".config" / "xling-matrix" / "activation.json" def save_activation(self, activation_key: str, fingerprint: str) -> None: """保存激活信息""" data = { "activation_key": activation_key, "fingerprint": fingerprint, "activated_at": datetime.now().isoformat(), "version": "0.1.0" } self.ACTIVATION_FILE.parent.mkdir(parents=True, exist_ok=True) AtomicWriter.write(str(self.ACTIVATION_FILE), data) def load_activation(self) -> dict | None: """加载激活信息""" if not self.ACTIVATION_FILE.exists(): return None with open(self.ACTIVATION_FILE, 'r', encoding='utf-8') as f: return json.load(f) def is_activated(self) -> bool: """检查是否已激活""" activation = self.load_activation() if not activation: return False # 检查硬件指纹是否匹配 current_fingerprint = LicenseManager().generate_fingerprint() return activation.get("fingerprint") == current_fingerprint ``` **许可证验证流程:** ``` 启动 → 检查本地激活 → [无激活] 显示激活对话框 ↓ [有激活] 检查硬件指纹 → [不匹配] 重新激活 ↓ [匹配] 检查过期 → [已过期] 提示续费 ↓ [有效] 验证 CU 余额 → [不足] 提示充值 ↓ [有效] 允许使用 ``` ## Performance Optimization ### GPU Optimization **1. 批处理策略:** ```python # modules/translation/batch_processor.py class BatchProcessor: """批处理优化器""" def __init__(self, max_tokens: int = 4096): self.max_tokens = max_tokens def create_batches(self, texts: list[str]) -> list[list[str]]: """将文本分割为最优批次 策略: 1. 按 token 数量分组 2. 每批接近 max_tokens 但不超过 3. 相邻文本尽量在同一批(保持上下文) """ batches = [] current_batch = [] current_tokens = 0 for text in texts: tokens = self._count_tokens(text) if current_tokens + tokens > self.max_tokens and current_batch: batches.append(current_batch) current_batch = [text] current_tokens = tokens else: current_batch.append(text) current_tokens += tokens if current_batch: batches.append(current_batch) return batches def _count_tokens(self, text: str) -> int: """估算 token 数量(中文约 1.5 字符/token)""" return int(len(text) / 1.5) ``` **2. 动态批次大小调整:** ```python # infrastructure/gpu/batch_optimizer.py import torch class BatchSizeOptimizer: """动态批次大小优化器""" def __init__(self, initial_size: int = 16): self.current_size = initial_size self.min_size = 4 self.max_size = 32 def adjust_for_memory(self, oom_occurred: bool) -> int: """根据显存使用情况调整批次大小""" if oom_occurred: self.current_size = max(self.min_size, self.current_size // 2) else: # 逐步增加以找到最优值 self.current_size = min(self.max_size, int(self.current_size * 1.2)) return self.current_size def get_memory_info(self) -> dict: """获取 GPU 显存信息""" if not torch.cuda.is_available(): return {"available": False} return { "available": True, "total_gb": torch.cuda.get_device_properties(0).total_memory / 1e9, "allocated_gb": torch.cuda.memory_allocated(0) / 1e9, "free_gb": (torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated(0)) / 1e9, } ``` ### I/O Optimization **1. 增量进度保存:** ```python # infrastructure/storage/incremental.py class IncrementalProgressSaver: """增量进度保存器 - 减少磁盘写入""" def __init__(self, threshold: int = 5): self.threshold = threshold # 进度变化超过 5% 才保存 self.last_saved = 0 def should_save(self, current_progress: int) -> bool: return abs(current_progress - self.last_saved) >= self.threshold def mark_saved(self, progress: int) -> None: self.last_saved = progress ``` **2. 文件读取优化:** ```python # infrastructure/storage/chunked_reader.py class ChunkedFileReader: """分块文件读取器 - 支持大文件""" def __init__(self, chunk_size: int = 8192): self.chunk_size = chunk_size def read_by_paragraphs(self, filepath: str) -> list[str]: """按段落读取文件(更适合小说)""" with open(filepath, 'r', encoding='utf-8') as f: content = f.read() # 按双换行符分割段落 paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()] return paragraphs ``` ## Deployment Architecture ### Application Packaging **1. PyInstaller 配置:** ```python # build/pyinstaller_spec.py import sys from PyInstaller.utils.hooks import collect_data_files, collect_submodules block_cipher = None datas = [ ('models', 'models'), ('assets', 'assets'), ] hiddenimports = [ 'PyQt6.sip', 'ctranslate2', 'torch', ] a = Analysis( ['src/xling_matrix/__main__.py'], pathex=[], binaries=[], datas=datas, hiddenimports=hiddenimports, hookspath=[], hooksconfig={}, runtime_hooks=[], excludes=[], win_no_prefer_redirects=False, win_private_assemblies=False, cipher=block_cipher, noarchive=False, ) pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher) exe = EXE( pyz, a.scripts, a.binaries, a.zipfiles, a.datas, [], name='序灵Matrix助手', debug=False, bootloader_ignore_signals=False, strip=False, upx=True, upx_exclude=[], runtime_tmpdir=None, console=True, disable_windowed_traceback=False, argv_emulation=False, target_arch=None, codesign_identity=None, entitlements_file=None, icon='assets/icons/app_icon.ico', ) ``` **2. 安装程序配置:** ``` 安装结构: 序灵Matrix助手/ ├── 序灵Matrix助手.exe # 主程序 ├── models/ # 翻译模型(首次运行时下载) │ └── m2m100_418m_ct2/ ├── configs/ │ └── default_config.yaml └── README.txt 用户数据目录: ~/Documents/xling-matrix/ # Windows ~/Documents/xling-matrix/ # macOS ~/xling-matrix/ # Linux ``` ### Distribution Strategy **1. 版本管理:** ```python # core/version.py __version__ = "0.1.0" __build__ = "20260313" def get_version() -> str: return f"{__version__}+{__build__}" ``` **2. 更新检查 (Growth 阶段):** ```python # infrastructure/update/update_checker.py import requests class UpdateChecker: """更新检查器""" UPDATE_URL = "https://updates.xling-matrix.com/version.json" def check_for_updates(self, current_version: str) -> dict | None: """检查是否有新版本""" try: response = requests.get(self.UPDATE_URL, timeout=5) response.raise_for_status() data = response.json() if self._is_newer(current_version, data["latest_version"]): return { "has_update": True, "latest_version": data["latest_version"], "download_url": data["download_url"], "release_notes": data.get("release_notes", ""), } except Exception: pass return None def _is_newer(self, current: str, latest: str) -> bool: """比较版本号""" from packaging import version return version.parse(latest) > version.parse(current) ``` --- ## Architecture Summary 本架构设计文档定义了序灵 Matrix 助手的完整技术架构,包括: - **分层架构**: Presentation / Application / Domain / Infrastructure - **核心设计模式**: Pipeline, State Machine, Repository, Observer - **Crash-Safe 机制**: 原子写入确保数据安全 - **GPU 加速**: CTranslate2 + 动态批处理优化 - **六模块流水线**: Fingerprint → Cleaning → Terminology → Translation → Upload - **本地优先**: 100% 本地处理,零数据泄露 - **零授权费**: 仅使用 MIT 协议依赖 本架构确保所有 AI 代理可以协同工作,编写一致、兼容的代码。 --- **文档版本**: 1.0 **最后更新**: 2026-03-13 **状态**: 完成 ✅