semantic-search-cwicr
Semantic search in DDC CWICR construction database using vector embeddings. Find similar work items and resources for cost estimation.
git clone --depth 1 https://github.com/datadrivenconstruction/DDC_Skills_for_AI_Agents_in_Construction /tmp/semantic-search-cwicr && cp -r /tmp/semantic-search-cwicr/1_DDC_Toolkit/CWICR-Database/semantic-search-cwicr ~/.claude/skills/semantic-search-cwicrSKILL.md
# Semantic Search in DDC CWICR Database
## Business Case
### Problem Statement
Construction cost estimation requires finding relevant work items from large databases. Traditional keyword search fails when:
- Users describe work in natural language
- Terminology varies across regions and languages
- Similar work items have different naming conventions
### Solution
DDC CWICR database provides pre-computed embeddings (OpenAI text-embedding-3-large, 3072 dimensions) enabling semantic similarity search across 55,719 work items in 9 languages.
### Business Value
- **90% faster** work item lookup compared to manual search
- **Multi-language** support: Arabic, Chinese, German, English, Spanish, French, Hindi, Portuguese, Russian
- **Higher accuracy** by finding semantically similar items, not just keyword matches
## Technical Implementation
### Prerequisites
```bash
pip install qdrant-client openai pandas
```
### Database Setup
```bash
# Download Qdrant snapshot
wget https://github.com/datadrivenconstruction/OpenConstructionEstimate-DDC-CWICR/releases/download/v0.1.0/qdrant_snapshot_en.tar.gz
# Start Qdrant with Docker
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant
```
### Python Implementation
```python
import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import openai
class CWICRSemanticSearch:
def __init__(self, qdrant_host: str = "localhost", port: int = 6333):
self.client = QdrantClient(host=qdrant_host, port=port)
self.collection_name = "ddc_cwicr_en"
self.embedding_model = "text-embedding-3-large"
self.embedding_dim = 3072
def get_embedding(self, text: str) -> list:
"""Generate embedding for search query."""
response = openai.embeddings.create(
model=self.embedding_model,
input=text
)
return response.data[0].embedding
def search_work_items(self, query: str, limit: int = 10,
min_score: float = 0.7) -> pd.DataFrame:
"""Search for similar work items."""
query_vector = self.get_embedding(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
limit=limit,
score_threshold=min_score
)
items = []
for result in results:
item = result.payload
item['similarity_score'] = result.score
items.append(item)
return pd.DataFrame(items)
def search_by_category(self, query: str, category: str,
limit: int = 10) -> pd.DataFrame:
"""Search within specific category."""
query_vector = self.get_embedding(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
query_filter={
"must": [{"key": "category", "match": {"value": category}}]
},
limit=limit
)
return pd.DataFrame([{**r.payload, 'score': r.score} for r in results])
def estimate_cost(self, work_items: pd.DataFrame,
quantities: dict) -> dict:
"""Calculate cost from matched work items."""
total_cost = 0
breakdown = []
for _, item in work_items.iterrows():
if item['work_item_code'] in quantities:
qty = quantities[item['work_item_code']]
cost = qty * item.get('unit_price', 0)
total_cost += cost
breakdown.append({
'item': item['description'],
'quantity': qty,
'unit_price': item.get('unit_price', 0),
'total': cost
})
return {
'total_cost': total_cost,
'breakdown': breakdown,
'currency': 'Regional default'
}
```
## Usage Examples
### Basic Search
```python
search = CWICRSemanticSearch()
# Natural language query
results = search.search_work_items("brick masonry wall construction")
print(results[['description', 'unit', 'unit_price', 'similarity_score']])
```
### Cost Estimation
```python
# Find work items for foundation work
foundation_items = search.search_work_items(
"reinforced concrete foundation excavation and pouring",
limit=20
)
# Estimate with quantities
quantities = {
'CONC-001': 150, # cubic meters
'EXCV-002': 200, # cubic meters
}
estimate = search.estimate_cost(foundation_items, quantities)
print(f"Estimated Cost: ${estimate['total_cost']:,.2f}")
```
## Database Schema
| Field | Type | Description |
|-------|------|-------------|
| work_item_code | string | Unique identifier |
| description | string | Work item description |
| unit | string | Measurement unit |
| labor_norm | float | Labor hours per unit |
| material_cost | float | Material cost per unit |
| equipment_cost | float | Equipment cost per unit |
| unit_price | float | Total price per unit |
| category | string | Work category |
| embedding | vector[3072] | Pre-computed embedding |
## Best Practices
1. **Use specific queries** - "reinforced concrete slab 200mm" beats "concrete"
2. **Filter by category** - Narrow results to relevant work types
3. **Check similarity scores** - Scores below 0.7 may need manual verification
4. **Combine with QTO** - Use BIM quantities for automated estimation
## Resources
- **GitHub**: [OpenConstructionEstimate-DDC-CWICR](https://github.com/datadrivenconstruction/OpenConstructionEstimate-DDC-CWICR)
- **Releases**: [v0.1.0 Database Downloads](https://github.com/datadrivenconstruction/OpenConstructionEstimate-DDC-CWICR/releases)
- **Qdrant Docs**: https://qdrant.tech/documentation/Generate automated daily progress reports from site data. Track work completed, labor hours, equipment usage, and weather conditions.
Analyze labor productivity from site data. Compare planned vs actual, identify trends, benchmark against industry standards.
Create interactive KPI dashboards for construction projects. Track schedule, cost, quality, and safety metrics in real-time.
Detect and analyze geometric clashes in BIM models. Identify MEP, structural, and architectural conflicts before construction.
Classify BIM elements using AI and standard classification systems. Map elements to UniFormat, MasterFormat, OmniClass, and CWICR codes.
Generate comprehensive BIM model validation reports. Check data quality, completeness, and compliance with standards.
Calculate CO2 emissions and carbon footprint from BIM model data. Analyze embodied carbon by material, element, and building system.
Extract quantities from IFC/Revit models for quantity takeoff. Uses DDC converters to get element counts, areas, volumes, lengths with grouping and reporting.