Skip to main content
ClaudeWave
Skill125 estrellas del repoactualizado today

detecting-tips-zones

Text-prompted image zone detection using TIPSv2 B/14 on CPU. Produces `focus_targets` / `focus_edges` bbox lists from natural-language labels, ready to feed into `svg-portrait-mode`. Use when you want automatic foreground/background separation from prompts like "dog face" + "wooden floor" instead of hand-annotating bboxes.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/oaustegard/claude-skills /tmp/detecting-tips-zones && cp -r /tmp/detecting-tips-zones/detecting-tips-zones ~/.claude/skills/detecting-tips-zones
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Detecting TIPS Zones

Zero-shot zone detection: text prompts → patch-grid cosine heatmaps → bboxes.
Companion to `svg-portrait-mode` — replaces manual `focus_targets` / `focus_edges`
annotation with a TIPSv2 B/14 forward pass.

## Quick Start

```python
from tips_zones import detect_zones
from portrait_mode import portrait_mode

focus_targets, focus_edges = detect_zones(
    "photo.jpg",
    targets=["dog face"],
    edges=["dog paws", "dog ears", "dog body"],
    distractors=["wooden floor", "carpet rug", "shoes", "wall"],
    ckpt_dir="/path/to/tips/checkpoints",
    tips_root="/path/to/tips",
)

svg, stats = portrait_mode(
    "photo.jpg",
    focus_targets=focus_targets,
    focus_edges=focus_edges,
    style_transforms={"background": "desaturate:0.7"},
)
```

Amortise model load across multiple images:

```python
from tips_zones import load_models, detect_zones

models = load_models(ckpt_dir, tips_root, device="cpu")
for img in images:
    ft, fe = detect_zones(img, targets=[...], edges=[...], distractors=[...],
                          ckpt_dir=ckpt_dir, tips_root=tips_root, models=models)
    ...
```

## How It Works

```
image → B/14 vision encoder (MaskCLIP values trick on last block)
     → (32×32 patch grid at 448, or 64×64 at 896) × 768-d patch features
text labels → prompt ensemble (9 TCL templates) → B/14 text encoder
     → per-label mean feature → L2-normalise
per-label heatmap = cos(patch feature, label feature)  # raw, no softmax
bbox = top-k% patches → largest connected component → scaled + padded to image coords
```

### Why no softmax over labels

Naïve softmax assumes labels are mutually exclusive. `dog face`, `dog ears`,
and `dog body` are all true of the same pixels, so softmax collapses to
near-uniform and every heatmap covers the whole subject. Raw cosines +
per-label top-k threshold works much better — at the cost of requiring
**distractor labels** to anchor the relative scale. Always pass some
distractors (floor, wall, props — whatever is in the scene but not the
subject).

## Parameters

```python
detect_zones(
    image,                     # path | PIL Image
    targets,                   # ["main subject label", ...]
    edges=(),                  # ["sub-region label", ...]
    distractors=(),            # scene elements to anchor against — pass these!
    *,
    ckpt_dir,                  # has tips_v2_oss_b14_{vision,text}.npz + tokenizer.model
    tips_root,                 # local clone of google-deepmind/tips
    input_size=448,            # 448 → 32×32 grid, 896 → 64×64 (~12× slower on CPU)
    target_top_frac=0.04,      # fraction of patches kept per target label
    edge_top_frac=0.06,        # fraction of patches kept per edge label
    pad_frac=0.02,             # bbox padding as fraction of image dim
    device="cpu",
    models=None,               # optional pre-loaded (img_model, text_model, tokenizer)
)
```

Returns `(focus_targets, focus_edges)` — both lists of `{'bbox': (x1,y1,x2,y2), 'label': str}`.

## Performance (CPU, 16 cores)

| Step | Time |
|------|------|
| `load_models` (warm) | ~3.5s |
| `load_models` (cold, over 9p) | ~50s |
| Text encoding (9 templates × N labels) | ~0.1s |
| Vision forward @ 448 | 0.3–0.6s |
| Vision forward @ 896 | ~6–7s |

Inference is negligible next to `portrait_mode()` on large images.

## Capability Notes

**Subject / background split: strong.** B/14 separates subject from scene
reliably — typical split ~30/70 subject:background on single-subject photos.

**Sub-part discrimination: weak at B/14 + 448.** "dog face" vs "dog paws" vs
"dog ears" tend to fire on the same region. The 32×32 patch grid is not the
bottleneck (64×64 at 896 barely helps); B/14's patch features just don't
encode fine sub-part semantics strongly. If you need per-part zones:

1. Sharpen prompts — "close-up of dog's furry face" > "dog face" (try first)
2. L/14 or SO/14 model (richer features, larger download)
3. Sliding-window inference (tile crops, stitch heatmaps)

For coarse target/edge zoning (the `portrait_mode` use case), B/14 at 448 is
enough.

## Requirements

Python deps:

```bash
pip install torch torchvision tensorflow tensorflow-text scipy pillow numpy --break-system-packages -q
```

Upstream TIPS repo (for the `tips.pytorch` image/text encoder modules):

```bash
git clone https://github.com/google-deepmind/tips /path/to/tips
```

B/14 checkpoints (~500MB total) go in a directory passed as `ckpt_dir`:

- `tips_v2_oss_b14_vision.npz`
- `tips_v2_oss_b14_text.npz`
- `tokenizer.model`

Download links are in the TIPS repo README.

## Prompt Engineering Tips

- **Always include distractors.** Without them, top-k thresholding has no
  relative scale. 3–7 distractors covering scene elements (floor, wall,
  background objects) is the sweet spot.
- **Use concrete nouns over abstract ones.** "carpet rug" > "textured floor".
- **Top_frac tuning.** If a target bbox is too small, raise `target_top_frac`
  (0.04 → 0.08). Too big / bleeds into scene: lower it.
- **Pad modestly.** `pad_frac=0.02` works for most photos; raise to 0.05 for
  subjects near frame edges.

## EXIF Caveat

`portrait_mode` (via OpenCV) honours EXIF rotation. PIL (this skill's
preprocessing) does not. For correctly-oriented source images they agree; for
EXIF-rotated phone photos the detected bboxes will be in the *raw pixel*
orientation. Either:

- Re-save the source with EXIF baked in: `Image.open(p).rotate(0, expand=True).save(p)`
- Or call `ImageOps.exif_transpose(pil)` before passing to `detect_zones`.
accessing-github-reposSkill

GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.

api-credentialsSkill

Securely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.

asking-questionsSkill

Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.

assessing-impactSkill

>-

bm25Skill

>-

browsing-blueskySkill

Browse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.

building-github-indexSkill

Generate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".

categorizing-bsky-accountsSkill

Analyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.