first commit

2026-03-09 22:03:09 +08:00
commit 3a6a12eeb6
8 changed files with 2168 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,513 @@
+# 政策法规检索与整理系统
+
+一个自动化的中国税务政策法规智能检索与整理系统，支持定时任务、智能筛选、自动下载、去重分类和邮件报告功能。
+
+## 🎯 功能特性
+
+### 核心功能
+- **定时自动检索** - 支持配置每日自动执行检索任务（如工作日 09:00）
+- **多网站爬取** - 同时从国家税务总局、财政部、科技部等多个官方网站获取信息
+- **智能内容筛选** - 基于关键词匹配算法，自动识别最新政策、通知、公告等
+- **文件自动下载** - 支持 PDF、Word、Excel、TXT 等多种格式文件下载
+- **智能去重** - 基于标题相似度（Jaccard、Levenshtein）和内容哈希的多重去重机制
+- **自动分类** - 按税收政策、通知公告、法规文件等类别自动分类
+- **邮件报告** - 自动生成 Excel 汇总报告并发送邮件通知
+
+### 高级特性
+- **反爬策略** - User-Agent 轮换、请求间隔控制、自动重试机制
+- **代理池支持** - 可配置代理列表，自动轮换 IP
+- **磁盘空间检查** - 下载前自动检查磁盘剩余空间
+- **文件完整性校验** - 验证下载文件的完整性
+- **结构化日志** - JSON 格式日志，支持日志轮转
+- **失败告警** - 任务执行失败时自动发送告警通知
+- **多通道通知** - 支持邮件、钉钉、Webhook 等多种通知方式
+
+## 🚀 快速开始
+
+### 1. 安装依赖
+
+```bash
+cd .trae/skills/policy-regulations-retrieval
+pip install -r requirements.txt
+```
+
+### 2. 初始化配置
+
+```bash
+python policy_retrieval.py init
+```
+
+### 3. 执行检索任务
+
+```bash
+# 立即执行一次检索（默认发送邮件报告）
+python policy_retrieval.py run
+
+# 立即执行检索，不发送邮件
+python policy_retrieval.py run --no-email
+
+# 指定收件人执行检索
+python policy_retrieval.py run -e user@example.com -e another@example.com
+```
+
+### 4. 启动定时任务
+
+```bash
+# 启动定时任务（使用配置文件中的时间）
+python policy_retrieval.py schedule --enable
+
+# 指定执行时间（如每日 09:00）
+python policy_retrieval.py schedule --enable --time "09:00"
+
+# 禁用定时任务
+python policy_retrieval.py schedule --disable
+```
+
+### 5. 查看报告
+
+```bash
+# 查看最新生成的报告
+python policy_retrieval.py report
+```
+
+### 6. 查看帮助
+
+```bash
+python policy_retrieval.py help
+```
+
+## 📋 配置说明
+
+编辑 `config.yaml` 文件自定义系统行为：
+
+### 定时任务配置
+
+```yaml
+scheduler:
+  enabled: true           # 是否启用定时任务
+  time: "09:00"          # 每日执行时间
+  days:                  # 执行日期
+    - mon
+    - tue
+    - wed
+    - thu
+    - fri
+  max_instances: 3       # 最大并发实例数
+  coalesce: true         # 是否合并错过的任务
+```
+
+### 目标网站配置
+
+```yaml
+targets:
+  - name: "国家税务总局"
+    url: "https://www.chinatax.gov.cn/"
+    list_paths:
+      - "/npsite/chinatax/zcwj/"    # 政策文件路径
+      - "/npsite/chinatax/tzgg/"    # 通知公告路径
+    keywords:
+      - "最新"
+      - "通知"
+      - "公告"
+      - "政策"
+      - "法规"
+    enabled: true
+```
+
+### 下载配置
+
+```yaml
+download:
+  path: "./downloads"              # 下载目录
+  formats:                         # 支持的文件格式
+    - pdf
+    - doc
+    - docx
+    - txt
+    - xlsx
+  max_size: 52428800               # 最大文件大小（字节）
+  timeout: 60                      # 下载超时时间（秒）
+  retry: 3                         # 重试次数
+  user_agent: "Mozilla/5.0..."     # User-Agent
+```
+
+### 去重配置
+
+```yaml
+deduplication:
+  title_similarity: 0.8            # 标题相似度阈值
+  content_similarity: 0.9          # 内容相似度阈值
+  hash_algorithm: "simhash"        # 哈希算法
+```
+
+### 分类配置
+
+```yaml
+categories:
+  - name: "税收政策"
+    keywords:
+      - "税收"
+      - "税务"
+      - "纳税"
+      - "税费"
+      - "增值税"
+      - "所得税"
+    priority: 1                    # 优先级（数字越小优先级越高）
+
+  - name: "通知公告"
+    keywords:
+      - "通知"
+      - "公告"
+      - "通告"
+    priority: 2
+
+  - name: "法规文件"
+    keywords:
+      - "法规"
+      - "条例"
+      - "规章"
+      - "办法"
+      - "细则"
+    priority: 3
+
+  - name: "其他政策"
+    keywords: []                   # 空关键词表示默认类别
+    priority: 99
+```
+
+### 日志配置
+
+```yaml
+logging:
+  level: "INFO"                    # 日志级别
+  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+  file: "./logs/policy_retrieval.log"
+  max_bytes: 10485760              # 单个日志文件最大大小（10MB）
+  backup_count: 5                  # 保留日志文件数量
+```
+
+### 通知配置（可选）
+
+```yaml
+notification:
+  enabled: true
+  on_failure: true                 # 失败时通知
+  on_success: true                 # 成功时通知
+  email:
+    enabled: true
+    smtp_host: "smtp.qq.com"
+    smtp_port: 587
+    smtp_user: "your_email@qq.com"
+    smtp_password: "your_auth_code"  # 使用授权码
+    from_addr: "your_email@qq.com"
+    to_addrs:
+      - "user@example.com"
+      - "admin@example.com"
+```
+
+## 📁 项目结构
+
+```
+policy-regulations-retrieval/
+├── policy_retrieval.py      # 主程序入口
+├── scraper.py               # 网页爬取模块
+├── processor.py             # 数据处理模块（去重、分类）
+├── notifier.py              # 通知模块（邮件、钉钉等）
+├── config.yaml              # 配置文件
+├── requirements.txt         # Python 依赖
+├── README.md                # 项目说明
+├── SKILL.md                 # 技能描述
+├── logs/                    # 日志目录
+│   ├── policy_retrieval.log
+│   └── execution_*.json
+├── downloads/               # 下载文件目录
+│   ├── 税收政策/
+│   ├── 通知公告/
+│   └── 法规文件/
+└── output/                  # 输出报告目录
+    ├── summary_YYYYMMDD.xlsx
+    └── deduplicated_data_YYYYMMDD.json
+```
+
+## 🔧 核心模块说明
+
+### 1. 主程序 (policy_retrieval.py)
+
+系统主入口，协调各模块工作：
+- 加载配置文件
+- 初始化日志系统
+- 执行检索流程
+- 管理定时任务
+- 生成汇总报告
+
+**主要方法：**
+- `run()` - 执行一次完整的检索流程
+- `fetch_articles()` - 从目标网站获取文章列表
+- `filter_content()` - 筛选相关内容
+- `deduplicate()` - 去重处理
+- `categorize()` - 分类整理
+- `download_files()` - 下载文件
+- `generate_report()` - 生成 Excel 报告
+
+### 2. 网页爬取模块 (scraper.py)
+
+专业的网页爬虫，支持：
+- **ProxyManager** - 代理 IP 管理，支持轮换
+- **RateLimiter** - 请求频率限制
+- **WebScraper** - 通用网页爬虫基类
+- **TaxPolicyScraper** - 税务政策专用爬虫
+
+**特性：**
+- 自动重试机制（指数退避）
+- 请求间隔控制
+- 多种日期格式解析
+- CSS 选择器提取
+- 文件 URL 识别
+
+### 3. 数据处理模块 (processor.py)
+
+高效的数据处理工具：
+
+**TextSimilarity** - 文本相似度计算
+- Jaccard 相似度
+- Levenshtein 编辑距离
+- 余弦相似度
+
+**Deduplicator** - 去重处理器
+- 标题相似度检测
+- 内容哈希去重
+- 保留最新记录
+
+**CategoryClassifier** - 分类器
+- 关键词索引
+- 多类别评分
+- 批量分类
+
+**DataExporter** - 数据导出器
+- Excel 导出
+- JSON 导出
+- CSV 导出
+
+### 4. 通知模块 (notifier.py)
+
+邮件通知系统：
+- 支持 HTML 格式邮件
+- 附件支持
+- 政策检索报告模板
+- 错误告警模板
+- 多收件人支持
+
+## 📊 输出示例
+
+### Excel 报告示例
+
+| 标题 | 发布时间 | 来源 | 类别 | 摘要 | 关键词 | 下载链接 |
+|------|----------|------|------|------|--------|----------|
+| 关于实施新的组合式税费支持政策的通知 | 2024-01-15 | 国家税务总局 | 税收政策 | 为进一步减轻企业负担... | 最新，通知，政策 | /downloads/税收政策/xxx.pdf |
+| 国家税务总局公告 2024 年第 1 号 | 2024-01-10 | 国家税务总局 | 通知公告 | 关于...的公告 | 公告 | /downloads/通知公告/xxx.pdf |
+
+### 目录结构示例
+
+```
+downloads/
+├── 税收政策/
+│   ├── 2024-01-15_国家税务总局_关于实施新的组合式税费支持政策的通知.pdf
+│   └── 2024-01-10_国家税务总局_增值税优惠政策.pdf
+├── 通知公告/
+│   └── 2024-01-12_国家税务总局_系统升级公告.pdf
+└── 法规文件/
+    └── 2024-01-08_财政部_税收征管办法.docx
+```
+
+## 🔍 命令行参数
+
+```bash
+python policy_retrieval.py <command> [options]
+
+命令:
+  init              初始化配置文件
+  run               立即执行一次检索
+  schedule          启动定时任务
+  report            查看最新报告
+  help              显示帮助信息
+
+选项:
+  --config, -c      指定配置文件路径
+  --time, -t        设置定时任务执行时间
+  --enable          启用定时任务
+  --disable         禁用定时任务
+  --no-email        不发送邮件报告
+  --email-to, -e    指定收件人邮箱（可多次使用）
+```
+
+## 🛠️ 依赖说明
+
+### 核心依赖
+
+```
+requests>=2.28.0       # HTTP 请求库
+beautifulsoup4>=4.11.0 # HTML 解析库
+pyyaml>=6.0            # YAML 配置解析
+apscheduler>=3.10.0    # 定时任务调度器
+pandas>=1.5.0          # 数据处理库
+openpyxl>=3.0.0        # Excel 文件操作
+lxml>=4.9.0            # XML/HTML 解析器
+```
+
+### 可选依赖
+
+```
+# 以下为标准库，无需安装
+smtplib               # 邮件发送
+email                 # 邮件处理
+```
+
+## ⚙️ 高级配置
+
+### 代理池配置
+
+```yaml
+proxy:
+  enabled: true
+  pool:
+    - "http://user:pass@proxy1.example.com:8080"
+    - "http://user:pass@proxy2.example.com:8080"
+  rotate: true         # 自动轮换代理
+```
+
+### 反爬策略配置
+
+```yaml
+anti_crawler:
+  enabled: true
+  user_agents:         # User-Agent 池
+    - "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
+    - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
+  request_interval: 3  # 请求间隔（秒）
+  timeout: 30          # 请求超时（秒）
+  retry_times: 3       # 重试次数
+  retry_delay: 5       # 重试间隔（秒）
+```
+
+### 多通知渠道配置
+
+```yaml
+notification:
+  enabled: true
+  email:
+    enabled: true
+    # ... 邮件配置
+  dingtalk:
+    enabled: false
+    webhook: "https://oapi.dingtalk.com/robot/send?access_token=xxx"
+  webhook:
+    enabled: false
+    url: "https://your-webhook-url.com/notify"
+```
+
+## 📝 使用场景
+
+### 场景 1：每日自动检索
+
+配置工作日每天早上 9 点自动检索最新政策：
+
+```bash
+# 编辑 config.yaml
+scheduler:
+  enabled: true
+  time: "09:00"
+  days: [mon, tue, wed, thu, fri]
+
+# 启动定时任务
+python policy_retrieval.py schedule --enable
+```
+
+### 场景 2：临时检索任务
+
+临时执行一次检索，不发送邮件：
+
+```bash
+python policy_retrieval.py run --no-email
+```
+
+### 场景 3：多部门监控
+
+同时监控多个部门网站，发送到多个邮箱：
+
+```bash
+# 配置多个目标网站
+targets:
+  - name: "国家税务总局"
+    url: "https://www.chinatax.gov.cn/"
+    enabled: true
+  - name: "财政部"
+    url: "https://www.mof.gov.cn/"
+    enabled: true
+  - name: "科技部"
+    url: "https://www.most.gov.cn/"
+    enabled: true
+
+# 执行并发送到多个收件人
+python policy_retrieval.py run -e user1@example.com -e user2@example.com
+```
+
+### 场景 4：自定义分类规则
+
+根据业务需求自定义分类：
+
+```yaml
+categories:
+  - name: "增值税政策"
+    keywords: ["增值税", "进项税", "销项税"]
+    priority: 1
+  - name: "所得税政策"
+    keywords: ["所得税", "企业所得税", "个人所得税"]
+    priority: 2
+  - name: "税收优惠"
+    keywords: ["优惠", "减免", "退税"]
+    priority: 3
+```
+
+## 🔐 安全建议
+
+1. **邮箱配置** - 使用授权码而非密码
+2. **代理使用** - 建议使用正规代理服务商
+3. **请求频率** - 合理设置请求间隔，避免对目标网站造成压力
+4. **日志保护** - 定期清理日志文件，避免敏感信息泄露
+
+## ❓ 常见问题
+
+### Q: 如何修改检索频率？
+A: 编辑 `config.yaml` 中的 `scheduler.time` 和 `scheduler.days` 配置。
+
+### Q: 下载的文件在哪里？
+A: 默认在 `./downloads/` 目录下，按类别分子目录存放。
+
+### Q: 如何查看运行日志？
+A: 日志文件位于 `./logs/policy_retrieval.log`。
+
+### Q: 邮件发送失败怎么办？
+A: 检查 SMTP 配置、邮箱授权码、网络连接，查看详细日志。
+
+### Q: 如何添加新的目标网站？
+A: 在 `config.yaml` 的 `targets` 列表中添加新的网站配置。
+
+### Q: 定时任务如何停止？
+A: 按 Ctrl+C 停止当前运行的定时任务，或使用 `--disable` 参数禁用。
+
+## 📄 许可证
+
+本项目仅供学习和研究使用。
+
+## 🤝 贡献
+
+欢迎提交 Issue 和 Pull Request 来改进这个项目。
+
+## 📧 联系方式
+
+如有问题或建议，请通过邮件联系。
+
+---
+
+**最后更新**: 2024-01
+**版本**: 1.0.0