Documentation

Use Subsets with MCP clients or integrate directly with our REST API

Introduction

Philosophy & Architecture

Subsets is a data lakehouse that connects LLMs with curated statistical datasets. Query thousands of datasets directly through our MCP server or REST API.

The platform is built on these principles:

  • Open Infrastructure: Built on open formats (Parquet, Delta Lake) that you can use with any tool
  • Fast Queries: DuckDB-powered analytical queries return results in milliseconds
  • Authoritative Sources: Curated datasets from official statistical agencies and research institutions
  • MCP Native: First-class integration with Claude and other MCP-compatible AI assistants

Data is stored in Delta Lake format on Cloudflare R2, with all queries executed through DuckDB for fast analytical performance.

Open Infrastructure

The entire Subsets platform is built on open standards and formats:

Data Formats

  • Apache Parquet for columnar storage
  • Delta Lake for table management and versioning

Query Engine

  • DuckDB for analytical queries
  • Full SQL support with extensions
  • Zero-copy data access where possible

This open architecture means your data remains portable. You're never locked into proprietary formats, and the underlying files can be read by any tool that supports Parquet.

Tools & Commands

All Subsets tools with availability. Local tools run on your machine (CLI or local MCP). Remote tools call the Subsets API (REST or hosted MCP).

localCLI + Local MCP server
remoteRemote MCP (mcp.subsets.io) + REST API

Search Datasets

localremote

Search the Subsets catalog for datasets by keyword.

localCLI + Local MCP
CLI:subsets search <query> [--limit N]
MCP:search_datasets(q="gdp growth", limit=10)
remoteRemote MCP + REST API
MCP:search_datasets(q="gdp growth", limit=10)
REST:GET /datasets?q=gdp+growth&limit=10

Dataset Details

localremote

Get full metadata for a dataset including schema, statistics, and preview rows.

localCLI + Local MCP
CLI:subsets info <dataset_id>
MCP:get_dataset_details(dataset_id="wdi_gdp_growth")
remoteRemote MCP + REST API
MCP:get_dataset_details(dataset_id="wdi_gdp_growth")
REST:GET /datasets/{id}/summary

Execute SQL

localremote

Run SQL queries against datasets. Local queries installed datasets via DuckDB. Remote queries via API.

localCLI + Local MCP
CLI:subsets query "SELECT * FROM wdi LIMIT 10"
MCP:execute_sql_query(query="SELECT * FROM wdi LIMIT 10")
Queries datasets in your local collection
remoteRemote MCP + REST API
MCP:execute_sql_query(query="SELECT * FROM wdi LIMIT 10")
REST:POST /sql/query
Requires authentication

Add Dataset

local

Download a dataset to your local collection for offline querying.

localCLI + Local MCP
CLI:subsets add <dataset_id>
MCP:add_dataset(dataset_id="wdi_gdp_growth")

List Installed

local

List datasets in your local collection.

localCLI + Local MCP
CLI:subsets list
MCP:list_installed_datasets()

Sync Datasets

local

Update installed datasets to their latest versions.

localCLI + Local MCP
CLI:subsets sync [dataset_ids...]
MCP:sync_datasets(dataset_ids=["wdi_gdp_growth"])

Remove Dataset

local

Remove a dataset from your collection and delete its local data.

localCLI + Local MCP
CLI:subsets remove <dataset_id>
MCP:remove_dataset(dataset_id="wdi_gdp_growth")

Login

local

Save your API key for catalog access and dataset downloads.

localCLI + Local MCP
CLI:subsets login

Logout

local

Clear your saved API key.

localCLI + Local MCP
CLI:subsets logout

Status

local

Show current configuration, installed datasets, and disk usage.

localCLI + Local MCP
CLI:subsets status