deep-research
Deep-research is a multi-source investigation system that combines arXiv paper retrieval, ToolUniverse scientific databases (UniProt, OpenTargets, PubMed, FAERS, and 1000+ tools), and web search to collect and cross-analyze data, then generate professional PDF or DOCX research reports. Use it when users request research, literature reviews, technical reports, comparative analysis of solutions, or questions involving specialized domains like biotechnology, proteins, genes, and drug targets that require professional database support and multi-dimensional analysis.
git clone --depth 1 https://github.com/AgentTeam-TaichuAI/ScienceClaw /tmp/deep-research && cp -r /tmp/deep-research/Skills/deep-research ~/.claude/skills/deep-researchSKILL.md
# Deep Research — 多源深度调研系统
端到端的深度调研系统:从用户问题出发,智能选择数据源组合(arXiv 论文 + ToolUniverse 科学工具 + Web 搜索),多维度采集数据,逐源深度解读,围绕用户问题进行交叉归纳,最终生成专业研究报告。
## 工作流程总览
```
Phase 1: 问题拆解与数据源规划
→ Phase 2: 多源数据采集
2A: arXiv 论文检索 + 筛选 + PDF 下载
2B: ToolUniverse 科学工具数据采集
2C: Web 搜索补充采集
→ Phase 3: 数据提取与整理
→ Phase 4: 逐源深度解读
→ Phase 5: 围绕用户问题的交叉分析与报告撰写
→ Phase 6: 生成 PDF/DOCX 研究报告
```
**工作空间路径**:所有文件输出到 `/home/scienceclaw/sessionid/` 目录下。
---
## Phase 1: 问题拆解与数据源规划
收到用户问题后,**不要直接执行**。先分析问题,拆解为多个调研维度,并规划需要哪些数据源。
### 1.1 问题拆解
将用户问题按以下维度展开:
| 维度 | 说明 | 示例 |
|------|------|------|
| 核心概念 | 问题的主题词 | BRCA1, data center cooling |
| 技术路线 | 不同的技术方案 | PARP inhibitors, liquid cooling |
| 应用场景 | 特定的应用环境 | triple-negative breast cancer |
| 关联领域 | 密切相关的交叉领域 | DNA repair, homologous recombination |
| 优化目标 | 关注的性能指标 | survival rate, PUE |
### 1.2 数据源决策
**⚠️ 强制规则:对于任何研究型问题,ToolUniverse 和文献检索都是必选项,不可跳过。**
| 数据源 | 状态 | 说明 | 典型场景 |
|--------|------|------|----------|
| **文献检索** | **必选** | 任何研究问题都必须检索学术文献。根据领域选择来源:CS/AI/物理/数学 → arXiv;生物医药/化学 → PubMed/PubTator/EuropePMC(通过 ToolUniverse);通用学术 → OpenAlex/Semantic Scholar(通过 ToolUniverse)。**多个来源可叠加使用。** | 所有研究型问题 |
| **ToolUniverse** | **必选** | 任何研究问题都必须使用 ToolUniverse 获取专业数据库的结构化数据。即使不涉及生物医药,也应搜索是否有适用工具(天文、地球科学、化学、统计等均有覆盖)。 | 所有研究型问题 |
| **Web 搜索** | 推荐 | 获取最新资讯、行业报告、非学术数据补充 | 市场规模、最新进展、政策法规 |
**决策原则**:
- **文献检索 + ToolUniverse 是所有研究任务的双必选底线**,Web 搜索作为推荐补充
- 生物医药领域:ToolUniverse(UniProt/OpenTargets/PubMed/PubTator 等结构化数据) + 文献检索(`PubTator_search_publications` 或 `EuropePMC_search` 通过 ToolUniverse 调用,以及 arXiv 补充前沿预印本) + Web 搜索
- CS/AI/工程领域:arXiv 论文检索 + ToolUniverse(搜索领域相关工具如 HuggingFace/OpenML/DBLP 等) + Web 搜索
- 跨学科/其他领域:arXiv(如适用) + ToolUniverse(搜索领域工具,如天文 SIMBAD/NASA、地球科学 USGS、化学 COD 等) + 通用文献检索(OpenAlex/Semantic Scholar 通过 ToolUniverse) + Web 搜索
- **绝对禁止**只用单一数据源就直接开始写报告
### 1.3 生成调研计划
将上述分析写入计划文件:
```python
import json
plan = {
"question": "用户原始问题",
"dimensions": ["维度1", "维度2", "..."],
"data_sources": {
"literature": {
"enabled": True, # ⚠️ 必选 — 不可设为 False
"arxiv": {
"enabled": True, # CS/AI/物理/数学/工程领域必须启用
"queries": [
{"arxiv_query": "abs:%22keyword%22+AND+abs:topic", "label": "描述"},
# ... 8-12 个 query
],
"relevance_phrases": ["phrase1", "phrase2"],
"target_total": 50,
"top_k": 15,
},
"pubmed_via_tooluniverse": {
"enabled": True, # 生物医药/健康领域必须启用
"tools": ["PubTator_search_publications", "EuropePMC_search"],
"queries": ["搜索词1", "搜索词2"],
},
"general_academic": {
"enabled": True, # 通用学术文献检索
"tools": ["OpenAlex_search_works", "SemanticScholar_search_papers"],
"queries": ["搜索词1", "搜索词2"],
},
},
"tooluniverse": {
"enabled": True, # ⚠️ 必选 — 不可设为 False
"tasks": [
{"tool_query": "protein function", "purpose": "获取蛋白功能信息", "example_tool": "UniProt_get_function_by_accession"},
{"tool_query": "disease targets", "purpose": "获取疾病靶点", "example_tool": "OpenTargets_get_associated_targets_by_disease_efoId"},
# ... 根据需求列出 3-8 个 ToolUniverse 任务
],
},
"web_search": {
"enabled": True, # 推荐启用
"queries": ["搜索词1", "搜索词2"],
},
},
"output_dir": "/home/scienceclaw/sessionid",
}
with open("/home/scienceclaw/sessionid/research_plan.json", "w", encoding="utf-8") as f:
json.dump(plan, f, ensure_ascii=False, indent=2)
```
---
## Phase 2: 多源数据采集
根据 Phase 1 的计划,并行或顺序执行各数据源的采集。所有采集结果保存到 `research_data/` 目录。
**⚠️ 强制检查点:Phase 2 必须同时包含"文献检索"和"ToolUniverse 数据采集"两个环节。如果你发现自己只做了其中一个就准备进入 Phase 3,请停下来补齐另一个。**
### 2A: arXiv 论文检索(如已启用)
#### 构造 arXiv API Query
每个 query 使用 arXiv API 的搜索语法:
**基本规则**:
- `abs:` 搜索摘要字段(最常用)
- `ti:` 搜索标题字段
- `+AND+` 连接多个条件(交集)
- `+OR+` 连接多个条件(并集)
- `%22` 用于包裹精确短语(URL 编码的双引号)
- 单词间用 `+` 连接
**Query 构造示例**:
```
abs:%22data+center%22+AND+abs:cooling
abs:%22liquid+cooling%22+AND+abs:%22data+center%22
abs:%22immersion+cooling%22
ti:%22exact+phrase%22+AND+abs:keyword
cat:cs.AI+AND+abs:%22large+language+model%22
```
#### 生成 search_config.json 并执行
```python
import json
config = {
"question": "用户原始问题",
"queries": [
{"arxiv_query": "abs:%22keyword%22+AND+abs:topic", "label": "描述"},
# ... 8-12 个 query
],
"target_total": 50,
"top_k": 15,
"output_dir": "/home/scienceclaw/sessionid/research_papers",
"relevance_phrases": ["phrase1", "phrase2"],
"min_score": 4
}
with open("/home/scienceclaw/sessionid/search_config.json", "w", encoding="utf-8") as f:
json.dump(config, f, ensure_ascii=False, indent=2)
```
```bash
python3 /skills/deep-research/scripts/arxiv_paper_finder.py /home/scienceclaw/sessionid/search_config.json
```
**脚本工作流程**:多 Query 搜索 → 去重 → 相关性评分 → 筛选 TOP-K → 下载 PDF
**评分规则**:标题命中 +5 分/短语,摘要命中 +2 分/短语,多 query 命中 +3 分/额外命中,2025年+ +3分,2024年 +2分,2023年 +1分。
**执行后检查**:确认 `all_candidates.json`(40-60 篇)、`selected_papers.json` 质量、PDF 文件完整性。
### 2B: ToolUniverse 数据采集(⚠️ 必选)
使用 `tooluniverse_search` → `tooluniverse_info` → `tooluniverse_run` 三步流程采集专业数据。
**此步骤包含两个必须完成的子任务:**
1. **专业数据库采集**:获取领域特定的结构化数据
2. **学术文献检索**:通过 ToolUniverse 中的文献检索工具获取相关论文(此步骤**必须执行**,与 2A 的 arXiv 互补)
**工作流程**:
1. **搜索工具**:对每个任务维度,用 `tooluniverse_search` 找到合适的工具
2. **查看规格**:用 `tooluniverse_info` 确认参数要求
3. **执行采集**:用 `tooluniverse_run` 获取数据
4. **保存结果**:将每次调用结果保存到 `research_data/` 目录
```
# ── 子任务 1: 专业数据库采集 ──
# 示例:蛋白功能调研
tooluniverse_search(query="protein function analysis", limit=5)
tooluniverse_info(tool_name="UniProt_get_function_by_accession")
tooluniverse_run(tool_name="UniProt_get_function_by_accession", arguments='{"accession": "P38398"}')
→ write_file("research_data/uniprot_P38398_function.json", result)
# 示例:疾病靶点
tooluniverse_search(query="disease druUse this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.
自动配置飞书机器人应用。当用户要求配置飞书、创建飞书机器人、接入 Lark/飞书、设置飞书 app_id/app_secret、或询问如何配置飞书 IM 时触发此 skill。该 skill 通过 sandbox 内置浏览器自动完成飞书开放平台上的应用创建、权限配置、事件订阅和发布,用户仅需扫码登录。
MANDATORY: When a user asks to install, find, search, or add ANY skill (e.g. 'install hello-world skill', 'find a skill for X', 'add a skill'), you MUST first run `skills find <query>` to search the skills ecosystem. NEVER create a skill from scratch without searching first. Even if the name sounds simple, always search — it may already exist as a published skill.
Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
Use this skill any time a .pptx file is involved — as input, output, or both. This includes: creating slide decks, pitch decks, or presentations; reading or extracting text from .pptx files; editing or updating existing presentations; combining or splitting slide files; working with templates, layouts, speaker notes, or comments. Trigger whenever the user mentions 'deck', 'slides', 'presentation', or references a .pptx filename. If a .pptx file needs to be opened, created, or touched, use this skill.
Create new skills, modify and improve existing skills, and measure skill performance. MANDATORY: Use this skill whenever the user wants to create a custom skill from scratch, design a workflow as a skill, write their own SKILL.md, update or optimize an existing skill, run evals to test a skill, benchmark skill performance, or asks questions like 'how do I make a skill', 'create a skill for X', 'turn this into a skill', 'I want to build a skill'. Even if the user doesn't use the word 'skill' explicitly, trigger this if they want to capture a reusable workflow or set of instructions for the agent.
Create new tools or upgrade existing tools for the agent. MANDATORY: Use this skill whenever the user wants to create a custom tool, convert a script into a reusable tool, write a new tool function, upgrade or modify an existing tool, test and improve a tool in the sandbox, or asks things like 'make a tool for X', 'create a tool that does Y', 'improve the X tool', 'upgrade my tool', 'turn this script into a tool'. Even if the user doesn't use the word 'tool' explicitly, trigger this if they want to add a new callable capability to the agent or modify an existing one.
Access 1000+ scientific tools through ToolUniverse for drug discovery, protein analysis, genomics, literature search, clinical data, ADMET prediction, molecular docking, and more. Use when the user needs biomedical or scientific research capabilities.