code-science
code-science provides best practices for reproducible research computing, including project structure templates, environment management, random seed control, data versioning with DVC, configuration separation, experiment logging, and parallel computing patterns for CPU and I/O-bound tasks. Use this skill when organizing research code, setting up computational workflows, implementing reproducibility standards, configuring Jupyter notebooks, managing research data, or parallelizing scientific computations across multiple processors.
git clone --depth 1 https://github.com/beita6969/ScienceClaw /tmp/code-science && cp -r /tmp/code-science/skills/code-science ~/.claude/skills/code-scienceSKILL.md
# Scientific Programming
Best practices for research software and reproducible computation.
## Project Structure
```
project/
├── README.md # Project overview, how to reproduce
├── LICENSE # MIT, Apache 2.0, or GPL
├── requirements.txt # or environment.yml (conda)
├── setup.py / pyproject.toml
├── data/
│ ├── raw/ # Never modify raw data
│ ├── processed/ # Cleaned/transformed data
│ └── external/ # Third-party data
├── src/ or scripts/
│ ├── data_processing.py
│ ├── analysis.py
│ ├── models.py
│ └── visualization.py
├── notebooks/ # Exploratory analysis
│ ├── 01_eda.ipynb
│ ├── 02_modeling.ipynb
│ └── 03_figures.ipynb
├── results/
│ ├── figures/
│ └── tables/
├── tests/
└── docs/
```
## Reproducibility Checklist
1. **Environment**: Pin all dependencies with versions
```bash
pip freeze > requirements.txt
# or conda
conda env export > environment.yml
```
2. **Random seeds**: Set and document all random seeds
```python
import numpy as np
import random
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
# torch.manual_seed(SEED)
# tf.random.set_seed(SEED)
```
3. **Data versioning**: Use DVC or git-lfs for large data
```bash
dvc init
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc
```
4. **Configuration**: Separate config from code
```python
# config.yaml
# experiment:
# learning_rate: 0.001
# batch_size: 32
# epochs: 100
import yaml
with open('config.yaml') as f:
config = yaml.safe_load(f)
```
5. **Logging**: Record all experiments
```python
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s: %(message)s',
filename='experiment.log')
```
## Parallel Computing
```python
# Multiprocessing (CPU-bound)
from multiprocessing import Pool
import numpy as np
def process_chunk(data):
return heavy_computation(data)
with Pool(processes=8) as pool:
results = pool.map(process_chunk, data_chunks)
# Concurrent futures (simpler API)
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
with ProcessPoolExecutor(max_workers=8) as executor:
results = list(executor.map(process_func, items))
# For I/O-bound tasks (API calls, file reading)
with ThreadPoolExecutor(max_workers=20) as executor:
results = list(executor.map(fetch_data, urls))
```
## Performance Optimization
```python
# Profiling
import cProfile
cProfile.run('my_function()', sort='cumulative')
# Line profiling
# pip install line_profiler
# @profile decorator, then: kernprof -l -v script.py
# NumPy vectorization (avoid loops)
# Bad:
result = [x**2 + 2*x + 1 for x in data]
# Good:
result = data**2 + 2*data + 1
# Memory profiling
# pip install memory_profiler
# @profile decorator, then: python -m memory_profiler script.py
```
## Data Management
### FAIR Principles
- **Findable**: Persistent identifiers (DOI), rich metadata
- **Accessible**: Open protocols, authentication when needed
- **Interoperable**: Standard formats (CSV, JSON, HDF5, NetCDF)
- **Reusable**: Clear license, provenance, community standards
### File Formats for Science
| Format | Best For | Size | Speed |
|--------|----------|------|-------|
| CSV | Small tabular, universal | Large | Slow |
| Parquet | Large tabular, columnar | Small | Fast |
| HDF5 | Multidimensional arrays | Small | Fast |
| NetCDF | Climate/geospatial | Small | Fast |
| FITS | Astronomy | Medium | Fast |
| Feather | DataFrame interchange | Small | Very fast |
```python
# Parquet (recommended for large datasets)
df.to_parquet('data.parquet', compression='snappy')
df = pd.read_parquet('data.parquet')
# HDF5 (for arrays)
import h5py
with h5py.File('data.h5', 'w') as f:
f.create_dataset('experiment1', data=array)
```
## Testing Scientific Code
```python
import numpy as np
import pytest
def test_conservation_law():
"""Physical quantities should be conserved"""
initial_energy = compute_energy(initial_state)
final_energy = compute_energy(simulate(initial_state))
np.testing.assert_allclose(initial_energy, final_energy, rtol=1e-6)
def test_known_solution():
"""Compare against analytical solution"""
numerical = solve_numerically(params)
analytical = analytical_solution(params)
np.testing.assert_allclose(numerical, analytical, atol=1e-4)
def test_symmetry():
"""Result should be symmetric under transformation"""
result1 = compute(data)
result2 = compute(transform(data))
np.testing.assert_array_equal(result1, result2)
```
## Tips
- Raw data is sacred — never modify it, only create processed copies
- Use version control (git) from day one
- Write README before writing code
- Automate the full pipeline (Makefile or Snakemake)
- Document assumptions and decisions in code comments
- Use type hints for clarity in scientific code
- Publish code alongside papers (GitHub + Zenodo for DOI)Route plain-language requests for Pi, Claude Code, Codex, OpenCode, Gemini CLI, or ACP harness work into either OpenClaw ACP runtime sessions or direct acpx-driven sessions ("telephone game" flow). For coding-agent thread requests, read this skill first, then use only `sessions_spawn` for thread creation.
Use the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
|
|
|
|
OpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.