Working with Datasets in Terminal
Dataset operations for managing collections of tabular resources with metadata and schemas.
Available Commands
Section titled “Available Commands”The fairspec dataset command provides utilities for working with datasets:
infer- Automatically infer a dataset descriptor from data filescopy- Copy datasets to a local foldervalidate- Validate dataset descriptors and their resourceslist- List resources in a datasetscript- Interactive REPL session with loaded dataset
What is a Dataset?
Section titled “What is a Dataset?”A dataset is a collection of related data resources (tables) with:
- Metadata describing the dataset (title, description, license, etc.)
- Resource definitions for each table (path, format, schema)
- Table Schemas defining the structure of each resource
- Relationships and foreign keys between resources
Datasets use JSON descriptor files (often named dataset.json) following the Fairspec specification.
Infer Dataset
Section titled “Infer Dataset”Automatically generate a dataset descriptor from data files:
# Infer from single filefairspec dataset infer data.csv
# Infer from multiple filesfairspec dataset infer users.csv products.csv orders.csv
# Infer with remote filesfairspec dataset infer https://example.com/data1.csv data2.csv
# Save to descriptor filefairspec dataset infer *.csv --json > dataset.jsonInference Process
Section titled “Inference Process”The infer command automatically:
- Detects format for each file (CSV, JSON, Excel, etc.)
- Infers Table Schema for each resource
- Generates resource names from file names
- Creates a complete dataset descriptor
Options
Section titled “Options”--debug- Show debug information--json- Output as JSON
Format Options
Section titled “Format Options”Format detection and schema inference can be customized:
--delimiter <char>- CSV delimiter--header-rows <numbers>- Header row indices (JSON array)--sample-rows <number>- Sample size for schema inference--confidence <number>- Confidence threshold for type detection--column-types <json>- Override types for specific columns--keep-strings- Keep original string types--comma-decimal- Treat comma as decimal separator--month-first- Parse dates as month-first
Generated Descriptor
Section titled “Generated Descriptor”Example generated dataset descriptor:
{ "resources": [ { "name": "users", "data": "users.csv", "format": { "name": "csv", "delimiter": "," }, "tableSchema": { "properties": { "id": { "type": "integer" }, "name": { "type": "string" }, "email": { "type": "string" }, "created_at": { "type": "date" } }, "required": ["id", "name", "email"] } }, { "name": "orders", "data": "orders.csv", "format": { "name": "csv" }, "tableSchema": { "properties": { "order_id": { "type": "integer" }, "user_id": { "type": "integer" }, "amount": { "type": "number" }, "status": { "type": "string" } } } } ]}Copy Dataset
Section titled “Copy Dataset”Copy a dataset and all its resources to a local folder:
# Copy dataset to local folderfairspec dataset copy dataset.json --to-path ./local-dataset
# Copy remote datasetfairspec dataset copy https://example.com/dataset.json --to-path ./dataset
# Silent mode for automationfairspec dataset copy dataset.json --to-path ./output --silentCopy Behavior
Section titled “Copy Behavior”The copy command:
- Downloads all remote resources
- Preserves directory structure
- Updates resource paths in the descriptor to point to local files
- Creates the target directory if it doesn’t exist
- Saves the updated descriptor to the target location
Options
Section titled “Options”--to-path <path>(required) - Target directory path--silent- Suppress output messages--debug- Show debug information--json- Output as JSON
Example
Section titled “Example”Given a dataset with remote resources:
{ "resources": [ { "name": "users", "data": "https://example.com/data/users.csv" }, { "name": "products", "data": "https://example.com/data/products.csv" } ]}After copying:
fairspec dataset copy dataset.json --to-path ./localResults in:
./local/ dataset.json # Updated descriptor users.csv # Downloaded resource products.csv # Downloaded resourceValidate Dataset
Section titled “Validate Dataset”Validate a dataset descriptor and all its resources:
# Validate local datasetfairspec dataset validate dataset.json
# Validate remote datasetfairspec dataset validate https://example.com/dataset.json
# Output validation report as JSONfairspec dataset validate dataset.json --jsonValidation Checks
Section titled “Validation Checks”The validate command checks:
- Descriptor validity - Valid JSON and conforms to Data Package spec
- Resource existence - All referenced resources can be loaded
- Schema validation - Each resource validates against its Table Schema
- Referential integrity - Foreign key relationships are valid
- Format compliance - Resources match their declared formats
Validation Report
Section titled “Validation Report”Returns a validation report with:
valid- Boolean indicating if validation passederrors- Array of validation errors (if any)
Example validation errors:
{ "valid": false, "errors": [ { "type": "dataset/resource-not-found", "resourceName": "users", "message": "Resource file 'users.csv' not found" }, { "type": "table/schema", "resourceName": "orders", "rowNumber": 15, "propertyName": "amount", "message": "value must be a number" }, { "type": "dataset/foreign-key", "resourceName": "orders", "message": "Foreign key 'user_id' references non-existent value in 'users'" } ]}Options
Section titled “Options”--debug- Show debug information--json- Output as JSON
List Resources
Section titled “List Resources”List all resources in a dataset:
# List resourcesfairspec dataset list dataset.json
# List from remote datasetfairspec dataset list https://example.com/dataset.json
# Output as JSON arrayfairspec dataset list dataset.json --jsonOutput
Section titled “Output”Returns an array of resource names in the dataset:
Text output:
usersproductsorderstransactionsJSON output:
["users", "products", "orders", "transactions"]Options
Section titled “Options”--debug- Show debug information--json- Output as JSON
Interactive Scripting
Section titled “Interactive Scripting”Start an interactive REPL session with a loaded dataset:
# Load dataset and start REPLfairspec dataset script dataset.json
# Script remote datasetfairspec dataset script https://example.com/dataset.jsonAvailable in Session
Section titled “Available in Session”fairspec- Full fairspec librarydataset- Loaded dataset descriptor
Example Session
Section titled “Example Session”fairspec> dataset{ resources: [ { name: 'users', data: 'users.csv', ... }, { name: 'orders', data: 'orders.csv', ... } ]}
fairspec> dataset.resources.length2
fairspec> dataset.resources[0].name'users'
fairspec> const table = await fairspec.loadTable(dataset.resources[0])fairspec> await table.head(5).collect()DataFrame { ... }Common Workflows
Section titled “Common Workflows”Create Dataset from Files
Section titled “Create Dataset from Files”# 1. Infer dataset from multiple filesfairspec dataset infer data/*.csv --json > dataset.json
# 2. Manually edit dataset.json to add:# - Title and description# - License information# - Foreign key relationships# - Additional metadata
# 3. Validate the datasetfairspec dataset validate dataset.json
# 4. List resources to confirmfairspec dataset list dataset.jsonClone Remote Dataset
Section titled “Clone Remote Dataset”# 1. Copy remote dataset locallyfairspec dataset copy https://example.com/dataset.json --to-path ./local-data
# 2. Validate local copyfairspec dataset validate ./local-data/dataset.json
# 3. List resourcesfairspec dataset list ./local-data/dataset.jsonDataset Quality Assurance
Section titled “Dataset Quality Assurance”# 1. Validate the datasetfairspec dataset validate dataset.json
# 2. If validation fails, check individual resourcesfairspec table validate --from-dataset dataset.json --from-resource users
# 3. Inspect resource schemasfairspec table infer-schema --from-dataset dataset.json --from-resource users
# 4. Generate schema documentationfairspec table render-schema schema.json --to-format markdown --to-path docs/users-schema.mdDataset Evolution
Section titled “Dataset Evolution”# 1. Start with existing datasetfairspec dataset validate old-dataset.json
# 2. Add new data filesfairspec dataset infer old-data/*.csv new-data/*.csv --json > dataset.json
# 3. Merge metadata from old descriptor# (manual step - copy title, license, etc.)
# 4. Validate updated datasetfairspec dataset validate dataset.json
# 5. Verify all resourcesfairspec dataset list dataset.jsonAutomation and CI/CD
Section titled “Automation and CI/CD”#!/bin/bash
# Validate dataset in CI pipelineif fairspec dataset validate dataset.json --json | jq -e '.valid'; then echo "✓ Dataset validation passed" exit 0else echo "✗ Dataset validation failed" fairspec dataset validate dataset.json exit 1fiOutput Formats
Section titled “Output Formats”Text Output (default)
Section titled “Text Output (default)”Human-readable output with colors and formatting:
fairspec dataset list dataset.jsonOutput:
usersproductsordersJSON Output
Section titled “JSON Output”Machine-readable JSON for automation and scripting:
fairspec dataset validate dataset.json --jsonSilent Mode
Section titled “Silent Mode”Suppress all output except errors (for copy command):
fairspec dataset copy dataset.json --to-path ./output --silentUse exit code to check success:
if fairspec dataset copy dataset.json --to-path ./output --silent; then echo "Success"else echo "Failed"fiExamples
Section titled “Examples”Create Multi-Table Dataset
Section titled “Create Multi-Table Dataset”# Prepare your data files# - customers.csv# - orders.csv# - products.csv
# Infer the datasetfairspec dataset infer customers.csv orders.csv products.csv --json > dataset.json
# Enhance the descriptorcat > dataset.json << 'EOF'{ "name": "sales-data", "title": "Sales Database Export", "description": "Customer orders and product catalog", "license": "CC-BY-4.0", "resources": [ { "name": "customers", "data": "customers.csv", "tableSchema": { "properties": { ... } } }, { "name": "orders", "data": "orders.csv", "tableSchema": { "properties": { ... }, "foreignKeys": [ { "columns": ["customer_id"], "reference": { "resource": "customers", "columns": ["id"] } } ] } } ]}EOF
# Validatefairspec dataset validate dataset.jsonDownload and Validate Public Dataset
Section titled “Download and Validate Public Dataset”# Copy public datasetfairspec dataset copy https://data.example.org/climate/dataset.json \ --to-path ./climate-data
# Validate local copyfairspec dataset validate ./climate-data/dataset.json
# List available resourcesfairspec dataset list ./climate-data/dataset.json
# Explore specific resourcefairspec table describe --from-dataset ./climate-data/dataset.json \ --from-resource temperatureDataset Testing
Section titled “Dataset Testing”echo "Testing dataset integrity..."
# 1. Validate descriptorif ! fairspec dataset validate dataset.json --silent; then echo "✗ Dataset validation failed" fairspec dataset validate dataset.json exit 1fi
# 2. Check all resources existfor resource in $(fairspec dataset list dataset.json --json | jq -r '.[]'); do echo "Checking resource: $resource" if ! fairspec table describe --from-dataset dataset.json --from-resource "$resource" --silent; then echo "✗ Resource $resource could not be loaded" exit 1 fidone
echo "✓ All tests passed"Interactive Data Exploration
Section titled “Interactive Data Exploration”# Start interactive sessionfairspec dataset script dataset.json
# In REPL, explore the dataset:// List all resourcesdataset.resources.map(r => r.name)
// Load a specific resourceconst users = await fairspec.loadTable(dataset.resources.find(r => r.name === 'users'))
// Query the dataconst activeUsers = await users.filter(pl.col('active').eq(true)).collect()console.log(activeUsers)
// Check schemaconsole.log(dataset.resources[0].tableSchema)Working with Resources
Section titled “Working with Resources”All dataset commands integrate with table commands through the --from-dataset and --from-resource options:
# Load resource from datasetfairspec table describe --from-dataset dataset.json --from-resource users
# Query resourcefairspec table query --from-dataset dataset.json --from-resource orders \ "SELECT * FROM self WHERE status = 'shipped'"
# Validate resourcefairspec table validate --from-dataset dataset.json --from-resource products
# Infer resource schemafairspec table infer-schema --from-dataset dataset.json --from-resource usersThis approach allows you to:
- Work with resources without specifying paths or formats
- Use embedded Table Schemas automatically
- Maintain consistency across your dataset
- Simplify command-line usage