Documentation

Use Subsets with MCP clients or integrate directly with our REST API

Introduction

Philosophy & Architecture

Subsets indexes datasets directly from pre-approved GitHub repositories. Instead of requiring publishers to upload data through our API, we scrape their repositories and ingest Parquet files automatically.

This approach has several key advantages:

  • Version Control: All data changes are tracked in Git, providing a complete audit trail
  • No Upload Overhead: Publishers simply commit to their repository; we handle the rest
  • Open Infrastructure: The data lakehouse is built on open formats (Parquet, Apache Iceberg) and open protocols
  • Free Compute: Publishers don't pay for data transformation or hosting

The lakehouse uses Apache Iceberg for table management, with Cloudflare R2 for object storage. All queries are executed through DuckDB, providing fast analytical performance without requiring publishers to manage infrastructure.

Open Infrastructure

The entire Subsets platform is built on open standards and formats:

Data Formats

  • • Apache Parquet for columnar storage
  • • Apache Iceberg for table management
  • • ACID transactions with versioning

Query Engine

  • • DuckDB for analytical queries
  • • Full SQL support with extensions
  • • Zero-copy data access where possible

This open architecture means your data remains portable. You're never locked into proprietary formats, and the underlying files can be read by any tool that supports Parquet.

MCP Server

Subsets provides a Model Context Protocol (MCP) server that enables AI assistants to search and query statistical data directly. Use with your favorite MCP client, like Claude Desktop.

⚙️Installation & Setup

1. Install uv & Get API Key

Install uv (Python package manager):

curl -LsSf https://astral.sh/uv/install.sh | sh

Then sign up at subsets.io to get your API key.

2. Configure Your MCP Client

For Claude Desktop, add Subsets to your configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

Windows: %APPDATA%\Claude\claude_desktop_config.json

JSON
{
  "mcpServers": {
    "subsets": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/subsetsio/subsets-mcp-server.git",
        "mcp-server",
        "--api-key",
        "YOUR_KEY"
      ]
    }
  }
}

3. Restart Your MCP Client

Restart your MCP client (e.g., Claude Desktop). You should see the Subsets tools available in the tools menu.

🛠️Available MCP Tools

list_datasetsSearch and list available datasets
UsesGET /datasets

Input Schema

q: string - Search query for semantic search (optional)
limit: integer - Maximum results (1-100, default: 10)
min_score: number - Minimum relevance score (0-2, optional)

Example Output

JSON
{
  "total": 3,
  "datasets": [
    {
      "id": "eurostat_unemployment_2024",
      "title": "European Unemployment Rates 2024",
      "description": "Monthly unemployment rates for EU countries",
      "license": "CC-BY-4.0",
      "columns": [
        {"id": "country", "type": "string", "description": "Country code"},
        {"id": "rate", "type": "double", "description": "Unemployment rate"}
      ],
      "score": 1.85
    }
  ]
}
get_dataset_detailsGet detailed information about a specific dataset
UsesGET /datasets/{dataset_id}/summary

Input Schema

dataset_id: string - Dataset identifier (required)

Example Output

JSON
{
  "dataset_id": "eurostat_unemployment_2024",
  "title": "European Unemployment Rates 2024",
  "description": "Monthly unemployment rates for EU countries",
  "total_rows": 12450,
  "size_bytes": 245000,
  "columns": [
    {
      "name": "country",
      "type": "string",
      "description": "ISO 3166-1 alpha-2 country code"
    },
    {
      "name": "rate",
      "type": "double",
      "description": "Unemployment rate as percentage"
    }
  ],
  "preview": {
    "columns": ["country", "rate"],
    "rows": [["ES", 11.2], ["GR", 9.8]]
  },
  "executions7d": 42,
  "metadata": {
    "license": "CC-BY-4.0",
    "updated_at": "2024-11-24T10:00:00Z"
  }
}
execute_sql_queryRun SQL queries on available datasets
UsesPOST /sql/query

Input Schema

query: string - SQL query to execute (SELECT only, required)

Example Output

JSON
{
  "columns": ["country", "rate"],
  "rows": [
    ["ES", 11.2],
    ["GR", 9.8],
    ["IT", 7.5]
  ],
  "row_count": 3,
  "execution_time_ms": 45.2
}

API Reference

Authentication

Include your API key in the Authorization header:

Python
import requests

# Set your API key
API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Make a request
response = requests.get("https://api.subsets.io/datasets", headers=headers)
data = response.json()

Get your API key from the Settings page.

Query Endpoints

Access Subsets data directly through our REST API. Perfect for building applications, running analyses, or integrating with your existing data pipelines.

Base URL

https://api.subsets.io

Format

JSON (application/json)

🔍Query Endpoints

POST /sql/query

Used by: execute_sql_query

Execute read-only SQL queries on available datasets. Only SELECT statements are allowed.

Request Body
JSON
{
  "query": "SELECT * FROM eurostat_unemployment_2024 LIMIT 10"
}
Example Request
Python
import requests

API_KEY = "YOUR_API_KEY"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

query_data = {
    "query": "SELECT country, year, rate FROM eurostat_unemployment_2024 WHERE year = 2024 ORDER BY rate DESC LIMIT 5"
}

response = requests.post(
    "https://api.subsets.io/sql/query",
    headers=headers,
    json=query_data
)

result = response.json()
print(f"Columns: {result['columns']}")
print(f"Rows: {result['rows']}")
print(f"Query took {result['execution_time_ms']}ms")
Success Response (200)
JSON
{
  "columns": ["country", "year", "rate"],
  "rows": [
    ["ES", 2024, 11.2],
    ["GR", 2024, 9.8],
    ["IT", 2024, 7.5],
    ["FR", 2024, 7.1],
    ["SE", 2024, 6.8]
  ],
  "row_count": 5,
  "execution_time_ms": 42.5,
  "cache_hits": [],
  "cache_misses": []
}

GET /datasets

Used by: search_datasets

List datasets with pagination, filtering, and semantic search capabilities

Query Parameters
q: string - Search query for semantic search
limit: integer - Results per page (1-100, default: 10)
offset: integer - Skip results (default: 0)
min_score: number - Min relevance score (0-2)
license: string - Filter by license type
detailed: boolean - Include stats (default: false)
Example Request
Python
import requests

API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}

params = {
    "q": "unemployment europe",
    "limit": 5,
    "min_score": 1.0
}

response = requests.get(
    "https://api.subsets.io/datasets",
    headers=headers,
    params=params
)

datasets = response.json()
print(f"Found {datasets['total']} datasets")
for ds in datasets['datasets']:
    print(f"- {ds['id']}: {ds['title']} (score: {ds['score']})")

GET /datasets/{dataset_id}

Used by: get_dataset_details

Get complete metadata and schema for a specific dataset

Example Request
Python
import requests

API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}

response = requests.get(
    "https://api.subsets.io/datasets/eurostat_unemployment_2024",
    headers=headers
)

dataset = response.json()
print(f"Dataset: {dataset['title']}")
print(f"Columns: {len(dataset['columns'])}")