Documentation
Use Subsets with MCP clients or integrate directly with our REST API
Introduction
Philosophy & Architecture
Subsets is a data lakehouse that connects LLMs with curated statistical datasets. Query thousands of datasets directly through our MCP server or REST API.
The platform is built on these principles:
- •Open Infrastructure: Built on open formats (Parquet, Delta Lake) that you can use with any tool
- •Fast Queries: DuckDB-powered analytical queries return results in milliseconds
- •Authoritative Sources: Curated datasets from official statistical agencies and research institutions
- •MCP Native: First-class integration with Claude and other MCP-compatible AI assistants
Data is stored in Delta Lake format on Cloudflare R2, with all queries executed through DuckDB for fast analytical performance.
Open Infrastructure
The entire Subsets platform is built on open standards and formats:
Data Formats
- •Apache Parquet for columnar storage
- •Delta Lake for table management and versioning
Query Engine
- •DuckDB for analytical queries
- •Full SQL support with extensions
- •Zero-copy data access where possible
This open architecture means your data remains portable. You're never locked into proprietary formats, and the underlying files can be read by any tool that supports Parquet.
Tools & Commands
All Subsets tools with availability. Local tools run on your machine (CLI or local MCP). Remote tools call the Subsets API (REST or hosted MCP).
Search Datasets
Search the Subsets catalog for datasets by keyword.
subsets search <query> [--limit N]search_datasets(q="gdp growth", limit=10)search_datasets(q="gdp growth", limit=10)GET /datasets?q=gdp+growth&limit=10Dataset Details
Get full metadata for a dataset including schema, statistics, and preview rows.
subsets info <dataset_id>get_dataset_details(dataset_id="wdi_gdp_growth")get_dataset_details(dataset_id="wdi_gdp_growth")GET /datasets/{id}/summaryExecute SQL
Run SQL queries against datasets. Local queries installed datasets via DuckDB. Remote queries via API.
subsets query "SELECT * FROM wdi LIMIT 10"execute_sql_query(query="SELECT * FROM wdi LIMIT 10")execute_sql_query(query="SELECT * FROM wdi LIMIT 10")POST /sql/queryAdd Dataset
Download a dataset to your local collection for offline querying.
subsets add <dataset_id>add_dataset(dataset_id="wdi_gdp_growth")List Installed
List datasets in your local collection.
subsets listlist_installed_datasets()Sync Datasets
Update installed datasets to their latest versions.
subsets sync [dataset_ids...]sync_datasets(dataset_ids=["wdi_gdp_growth"])Remove Dataset
Remove a dataset from your collection and delete its local data.
subsets remove <dataset_id>remove_dataset(dataset_id="wdi_gdp_growth")Login
Save your API key for catalog access and dataset downloads.
subsets loginLogout
Clear your saved API key.
subsets logoutStatus
Show current configuration, installed datasets, and disk usage.
subsets status