Documentation
Use Subsets with MCP clients or integrate directly with our REST API
Introduction
Philosophy & Architecture
Subsets indexes datasets directly from pre-approved GitHub repositories. Instead of requiring publishers to upload data through our API, we scrape their repositories and ingest Parquet files automatically.
This approach has several key advantages:
- Version Control: All data changes are tracked in Git, providing a complete audit trail
- No Upload Overhead: Publishers simply commit to their repository; we handle the rest
- Open Infrastructure: The data lakehouse is built on open formats (Parquet, Apache Iceberg) and open protocols
- Free Compute: Publishers don't pay for data transformation or hosting
The lakehouse uses Apache Iceberg for table management, with Cloudflare R2 for object storage. All queries are executed through DuckDB, providing fast analytical performance without requiring publishers to manage infrastructure.
Open Infrastructure
The entire Subsets platform is built on open standards and formats:
Data Formats
- • Apache Parquet for columnar storage
- • Apache Iceberg for table management
- • ACID transactions with versioning
Query Engine
- • DuckDB for analytical queries
- • Full SQL support with extensions
- • Zero-copy data access where possible
This open architecture means your data remains portable. You're never locked into proprietary formats, and the underlying files can be read by any tool that supports Parquet.
MCP Server
Subsets provides a Model Context Protocol (MCP) server that enables AI assistants to search and query statistical data directly. Use with your favorite MCP client, like Claude Desktop.
⚙️Installation & Setup
1. Install uv & Get API Key
Install uv (Python package manager):
curl -LsSf https://astral.sh/uv/install.sh | shThen sign up at subsets.io to get your API key.
2. Configure Your MCP Client
For Claude Desktop, add Subsets to your configuration file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"subsets": {
"command": "uvx",
"args": [
"--from",
"git+https://github.com/subsetsio/subsets-mcp-server.git",
"mcp-server",
"--api-key",
"YOUR_KEY"
]
}
}
}3. Restart Your MCP Client
Restart your MCP client (e.g., Claude Desktop). You should see the Subsets tools available in the tools menu.
🛠️Available MCP Tools
list_datasets—Search and list available datasetsGET /datasetsInput Schema
q: string - Search query for semantic search (optional)limit: integer - Maximum results (1-100, default: 10)min_score: number - Minimum relevance score (0-2, optional)Example Output
{
"total": 3,
"datasets": [
{
"id": "eurostat_unemployment_2024",
"title": "European Unemployment Rates 2024",
"description": "Monthly unemployment rates for EU countries",
"license": "CC-BY-4.0",
"columns": [
{"id": "country", "type": "string", "description": "Country code"},
{"id": "rate", "type": "double", "description": "Unemployment rate"}
],
"score": 1.85
}
]
}get_dataset_details—Get detailed information about a specific datasetGET /datasets/{dataset_id}/summaryInput Schema
dataset_id: string - Dataset identifier (required)Example Output
{
"dataset_id": "eurostat_unemployment_2024",
"title": "European Unemployment Rates 2024",
"description": "Monthly unemployment rates for EU countries",
"total_rows": 12450,
"size_bytes": 245000,
"columns": [
{
"name": "country",
"type": "string",
"description": "ISO 3166-1 alpha-2 country code"
},
{
"name": "rate",
"type": "double",
"description": "Unemployment rate as percentage"
}
],
"preview": {
"columns": ["country", "rate"],
"rows": [["ES", 11.2], ["GR", 9.8]]
},
"executions7d": 42,
"metadata": {
"license": "CC-BY-4.0",
"updated_at": "2024-11-24T10:00:00Z"
}
}execute_sql_query—Run SQL queries on available datasetsPOST /sql/queryInput Schema
query: string - SQL query to execute (SELECT only, required)Example Output
{
"columns": ["country", "rate"],
"rows": [
["ES", 11.2],
["GR", 9.8],
["IT", 7.5]
],
"row_count": 3,
"execution_time_ms": 45.2
}API Reference
Authentication
Include your API key in the Authorization header:
import requests
# Set your API key
API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Make a request
response = requests.get("https://api.subsets.io/datasets", headers=headers)
data = response.json()Get your API key from the Settings page.
Query Endpoints
Access Subsets data directly through our REST API. Perfect for building applications, running analyses, or integrating with your existing data pipelines.
Base URL
https://api.subsets.ioFormat
JSON (application/json)🔍Query Endpoints
POST /sql/query
Used by: execute_sql_queryExecute read-only SQL queries on available datasets. Only SELECT statements are allowed.
Request Body
{
"query": "SELECT * FROM eurostat_unemployment_2024 LIMIT 10"
}Example Request
import requests
API_KEY = "YOUR_API_KEY"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
query_data = {
"query": "SELECT country, year, rate FROM eurostat_unemployment_2024 WHERE year = 2024 ORDER BY rate DESC LIMIT 5"
}
response = requests.post(
"https://api.subsets.io/sql/query",
headers=headers,
json=query_data
)
result = response.json()
print(f"Columns: {result['columns']}")
print(f"Rows: {result['rows']}")
print(f"Query took {result['execution_time_ms']}ms")Success Response (200)
{
"columns": ["country", "year", "rate"],
"rows": [
["ES", 2024, 11.2],
["GR", 2024, 9.8],
["IT", 2024, 7.5],
["FR", 2024, 7.1],
["SE", 2024, 6.8]
],
"row_count": 5,
"execution_time_ms": 42.5,
"cache_hits": [],
"cache_misses": []
}GET /datasets
Used by: search_datasetsList datasets with pagination, filtering, and semantic search capabilities
Query Parameters
q: string - Search query for semantic searchlimit: integer - Results per page (1-100, default: 10)offset: integer - Skip results (default: 0)min_score: number - Min relevance score (0-2)license: string - Filter by license typedetailed: boolean - Include stats (default: false)Example Request
import requests
API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}
params = {
"q": "unemployment europe",
"limit": 5,
"min_score": 1.0
}
response = requests.get(
"https://api.subsets.io/datasets",
headers=headers,
params=params
)
datasets = response.json()
print(f"Found {datasets['total']} datasets")
for ds in datasets['datasets']:
print(f"- {ds['id']}: {ds['title']} (score: {ds['score']})")GET /datasets/{dataset_id}
Used by: get_dataset_detailsGet complete metadata and schema for a specific dataset
Example Request
import requests
API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}
response = requests.get(
"https://api.subsets.io/datasets/eurostat_unemployment_2024",
headers=headers
)
dataset = response.json()
print(f"Dataset: {dataset['title']}")
print(f"Columns: {len(dataset['columns'])}")