Features
CloudCat is designed to make previewing cloud data effortless. Here's what it offers:
Cloud Storage Support
| Provider | URL Scheme | Status |
|---|---|---|
| Google Cloud Storage | gcs:// or gs:// |
Supported |
| Amazon S3 | s3:// |
Supported |
| Azure Blob Storage | az:// or azure:// |
Supported |
File Format Support
CloudCat automatically detects file formats from extensions and handles them appropriately:
| Format | Read | Auto-Detect | Streaming | Use Case |
|---|---|---|---|---|
| CSV | Yes | Yes | Yes | General data files |
| JSON | Yes | Yes | Yes | API responses, configs |
| JSON Lines | Yes | Yes | Yes | Log files, streaming data |
| Parquet | Yes | Yes | Yes | Spark/analytics data |
| Avro | Yes | Yes | Yes | Kafka, data pipelines |
| ORC | Yes | Yes | Yes | Hive, Hadoop ecosystem |
| Text | Yes | Yes | Yes | Log files, plain text |
| TSV | Yes | Via --delimiter |
Yes | Tab-separated data |
Compression Support
CloudCat automatically detects and decompresses files based on extension:
| Format | Extension | Built-in | Installation |
|---|---|---|---|
| Gzip | .gz, .gzip |
Yes | Included |
| Bzip2 | .bz2 |
Yes | Included |
| Zstandard | .zst, .zstd |
Optional | pip install cloudcat[zstd] |
| LZ4 | .lz4 |
Optional | pip install cloudcat[lz4] |
| Snappy | .snappy |
Optional | pip install cloudcat[snappy] |
Output Formats
| Format | Flag | Description |
|---|---|---|
| Table | -o table |
Beautiful ASCII table with colored headers (default) |
| JSON | -o json |
Standard JSON Lines output (one record per line) |
| Pretty JSON | -o jsonp |
Syntax-highlighted, indented JSON with colors |
| CSV | -o csv |
Comma-separated values for further processing |
Key Capabilities
- Schema Inspection - View column names and data types before previewing data
- Column Selection - Display only the columns you need with
--columns - Row Limiting - Control how many rows to preview with
--num-rows - Row Offset - Skip first N rows for pagination with
--offset - WHERE Filtering - Filter rows with SQL-like conditions using
--where - Record Counting - Get total record counts (instant for Parquet via metadata)
- Multi-File Reading - Combine data from multiple files in a directory
- Custom Delimiters - Support for tab, pipe, semicolon, and other delimiters
- Auto Decompression - Transparent handling of compressed files
- Directory Intelligence - Automatically discovers data files in Spark/Hive outputs
Installation
Homebrew (macOS Apple Silicon)
The easiest way to install on Apple Silicon Macs (M1/M2/M3/M4) — no Python required:
brew install jonathansudhakar1/cloudcat/cloudcatThis installs a self-contained binary that includes Python and all dependencies.
Intel Mac users: Homebrew bottles are not available for Intel. Please use
pip install 'cloudcat[all]'instead.
To upgrade:
brew upgrade cloudcatNote: On first run, macOS may block the app. Go to System Settings > Privacy & Security and click "Allow", or run:
xattr -d com.apple.quarantine $(which cloudcat)
pip (All Platforms)
Install CloudCat with all features enabled:
pip install 'cloudcat[all]'This includes support for all cloud providers (GCS, S3, Azure), all file formats (Parquet, Avro, ORC), and all compression types (zstd, lz4, snappy).
Standard pip Installation
For basic functionality with GCS, S3, and Azure support:
pip install cloudcatIncludes CSV, JSON, and text format support with gzip and bz2 compression.
Install with Specific Features
Install only what you need:
| Extra | Command | Adds Support For |
|---|---|---|
parquet |
pip install 'cloudcat[parquet]' |
Apache Parquet files |
avro |
pip install 'cloudcat[avro]' |
Apache Avro files |
orc |
pip install 'cloudcat[orc]' |
Apache ORC files |
compression |
pip install 'cloudcat[compression]' |
zstd, lz4, snappy |
zstd |
pip install 'cloudcat[zstd]' |
Zstandard compression only |
lz4 |
pip install 'cloudcat[lz4]' |
LZ4 compression only |
snappy |
pip install 'cloudcat[snappy]' |
Snappy compression only |
Requirements
- Homebrew: macOS Apple Silicon (M1/M2/M3/M4). Intel Mac users should use pip.
- pip: Python 3.7 or higher (all platforms)
- Cloud Credentials: Configured for your cloud provider (see Authentication)
Note: If using zsh (default on macOS), quotes around extras are required to prevent shell interpretation of brackets.
Upgrading
Upgrade to the latest version:
pip install --upgrade cloudcatOr with all extras:
pip install --upgrade 'cloudcat[all]'Verifying Installation
Check that CloudCat is installed correctly:
cloudcat --helpYou should see the help output with all available options.
Quick Start
Get started with CloudCat in seconds. Here are the most common operations:
Preview a CSV File
# From Google Cloud Storage
cloudcat -p gcs://my-bucket/data.csv
# From Amazon S3
cloudcat -p s3://my-bucket/data.csv
# From Azure Blob Storage
cloudcat -p az://my-container/data.csvPreview Parquet Files
# Preview first 10 rows (default)
cloudcat -p s3://my-bucket/analytics/events.parquet
# Preview 50 rows
cloudcat -p gcs://my-bucket/data.parquet -n 50Preview JSON Data
# Standard JSON
cloudcat -p s3://my-bucket/config.json
# JSON Lines (newline-delimited JSON)
cloudcat -p gcs://my-bucket/events.jsonl
# With pretty formatting
cloudcat -p az://my-container/logs.json -o jsonpSelect Specific Columns
cloudcat -p gcs://bucket/users.json -c id,name,emailFilter Rows
# Exact match
cloudcat -p s3://bucket/users.parquet --where "status=active"
# Numeric comparison
cloudcat -p gcs://bucket/events.json --where "age>30"
# String contains
cloudcat -p s3://bucket/logs.csv --where "message contains error"View Schema Only
cloudcat -p s3://bucket/events.parquet -s schema_onlyRead Compressed Files
CloudCat automatically decompresses files:
# Gzip
cloudcat -p gcs://bucket/data.csv.gz
# Zstandard
cloudcat -p s3://bucket/events.parquet.zst
# LZ4
cloudcat -p s3://bucket/data.csv.lz4Read from Spark Output Directory
cloudcat -p s3://my-bucket/spark-output/ -i parquetCloudCat automatically discovers data files and ignores metadata files like _SUCCESS.
Pagination
# Skip first 100 rows, show next 10
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10Convert and Export
# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv
# Export specific columns
cloudcat -p s3://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv
# Pipe to jq for JSON processing
cloudcat -p s3://bucket/events.json -o json | jq '.status'Command Reference
Complete reference for all CloudCat command-line options.
Usage
cloudcat [OPTIONS]Required Options
| Option | Description |
|---|---|
-p, --path TEXT |
Cloud storage path (required). Format: gcs://bucket/path, s3://bucket/path, or az://container/path |
Output & Format Options
| Option | Default | Description |
|---|---|---|
-o, --output-format TEXT |
table |
Output format: table, json, jsonp, csv |
-i, --input-format TEXT |
auto-detect | Input format: csv, json, parquet, avro, orc, text |
Data Selection Options
| Option | Default | Description |
|---|---|---|
-c, --columns TEXT |
all | Comma-separated list of columns to display |
-n, --num-rows INTEGER |
10 | Number of rows to display (0 for all rows) |
--offset INTEGER |
0 | Skip first N rows |
Filtering & Schema Options
| Option | Default | Description |
|---|---|---|
-w, --where TEXT |
none | Filter rows with SQL-like conditions |
-s, --schema TEXT |
show |
Schema display: show, dont_show, schema_only |
--no-count |
false | Disable automatic record counting |
Directory Handling Options
| Option | Default | Description |
|---|---|---|
-m, --multi-file-mode TEXT |
auto |
Directory handling: auto, first, all |
--max-size-mb INTEGER |
25 | Max data size for multi-file mode in MB |
CSV Options
| Option | Default | Description |
|---|---|---|
-d, --delimiter TEXT |
comma | CSV delimiter (use \t for tab) |
Cloud Provider Authentication
| Option | Description |
|---|---|
--profile TEXT |
AWS profile name (for S3 access) |
--project TEXT |
GCP project ID (for GCS access) |
--credentials TEXT |
Path to GCP service account JSON file |
--account TEXT |
Azure storage account name |
General Options
| Option | Description |
|---|---|
--help |
Show help message and exit |
Examples
# Basic usage
cloudcat -p gcs://bucket/data.csv
# Select columns and limit rows
cloudcat -p s3://bucket/users.parquet -c id,name,email -n 20
# Filter with WHERE clause
cloudcat -p gcs://bucket/events.json --where "status=active"
# Output as JSON
cloudcat -p az://container/data.csv -o json
# Read from Spark output directory
cloudcat -p s3://bucket/spark-output/ -i parquet -m all
# Use custom delimiter for TSV
cloudcat -p gcs://bucket/data.tsv -d "\t"
# Pagination
cloudcat -p s3://bucket/large.csv --offset 100 -n 10
# Schema only
cloudcat -p gcs://bucket/events.parquet -s schema_only
# With AWS profile
cloudcat -p s3://bucket/data.csv --profile production
# With GCP credentials
cloudcat -p gcs://bucket/data.csv --credentials /path/to/key.jsonWHERE Operators
CloudCat supports SQL-like filtering with the --where option. Filter your data before it's displayed to focus on exactly what you need.
Supported Operators
| Operator | Example | Description |
|---|---|---|
= |
status=active |
Exact match |
!= |
type!=deleted |
Not equal |
> |
age>30 |
Greater than |
< |
price<100 |
Less than |
>= |
count>=10 |
Greater than or equal |
<= |
score<=50 |
Less than or equal |
contains |
name contains john |
Case-insensitive substring match |
startswith |
email startswith admin |
String prefix match |
endswith |
file endswith .csv |
String suffix match |
Usage Examples
Exact Match
# Filter by status
cloudcat -p s3://bucket/users.parquet --where "status=active"
# Filter by category
cloudcat -p gcs://bucket/products.json --where "category=electronics"Numeric Comparisons
# Greater than
cloudcat -p s3://bucket/users.parquet --where "age>30"
# Less than
cloudcat -p gcs://bucket/orders.csv --where "price<100"
# Greater than or equal
cloudcat -p s3://bucket/events.json --where "count>=10"
# Less than or equal
cloudcat -p gcs://bucket/scores.parquet --where "score<=50"String Matching
# Contains (case-insensitive)
cloudcat -p s3://bucket/logs.json --where "message contains error"
# Starts with
cloudcat -p gcs://bucket/users.csv --where "email startswith admin"
# Ends with
cloudcat -p s3://bucket/files.json --where "filename endswith .csv"Not Equal
# Exclude deleted records
cloudcat -p gcs://bucket/records.parquet --where "status!=deleted"
# Exclude specific type
cloudcat -p s3://bucket/events.json --where "type!=test"Combining with Other Options
# Filter and select columns
cloudcat -p s3://bucket/users.parquet --where "status=active" -c id,name,email
# Filter and limit rows
cloudcat -p gcs://bucket/events.json --where "type=error" -n 50
# Filter with pagination
cloudcat -p s3://bucket/logs.csv --where "level=ERROR" --offset 100 -n 20
# Filter and export
cloudcat -p gcs://bucket/users.parquet --where "country=US" -o csv -n 0 > us_users.csvTips
- String values don't need quotes in the WHERE clause
- Comparisons are type-aware (numeric columns compare numerically)
- The
contains,startswith, andendswithoperators are case-insensitive - For best performance, filter on columns that exist in your data
Authentication
CloudCat uses standard authentication methods for each cloud provider. Configure your credentials once and CloudCat will use them automatically.
Google Cloud Storage (GCS)
CloudCat uses Application Default Credentials (ADC) for GCS authentication.
Option 1: User Credentials (Development)
Best for local development:
gcloud auth application-default loginThis opens a browser for Google account authentication.
Option 2: Service Account (Environment Variable)
Set the path to your service account JSON file:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"Then use CloudCat normally:
cloudcat -p gcs://bucket/data.csvOption 3: Service Account (CLI Option)
Pass the credentials file directly:
cloudcat -p gcs://bucket/data.csv --credentials /path/to/service-account.jsonOption 4: Specify GCP Project
If your credentials have access to multiple projects:
cloudcat -p gcs://bucket/data.csv --project my-gcp-projectAmazon S3
CloudCat uses the standard AWS credential chain.
Option 1: Environment Variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"Option 2: AWS Credentials File
Configure credentials using the AWS CLI:
aws configureThis creates ~/.aws/credentials with your access keys.
Option 3: Named Profile
Use a specific AWS profile:
cloudcat -p s3://bucket/data.csv --profile productionProfiles are defined in ~/.aws/credentials:
[production]
aws_access_key_id = AKIA...
aws_secret_access_key = ...
region = us-west-2Option 4: IAM Role (EC2/ECS/Lambda)
When running on AWS infrastructure (EC2, ECS, Lambda), CloudCat automatically uses the attached IAM role. No configuration needed.
Azure Blob Storage
CloudCat supports multiple authentication methods for Azure.
Option 1: Connection String (Simplest)
Set the full connection string:
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"Option 2: Azure AD Authentication
Use Azure CLI login with account URL:
# Set the account URL
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
# Login with Azure CLI
az loginCloudCat will use DefaultAzureCredential to authenticate.
Option 3: Storage Account (CLI Option)
Specify the storage account directly:
cloudcat -p az://container/data.csv --account mystorageaccountThis requires either a connection string or Azure AD authentication to be configured.
Path Formats
| Provider | URL Format | Example |
|---|---|---|
| GCS | gcs://bucket/path or gs://bucket/path |
gcs://my-bucket/data/file.csv |
| S3 | s3://bucket/path |
s3://my-bucket/data/file.parquet |
| Azure | az://container/path or azure://container/path |
az://my-container/data/file.json |
Troubleshooting Authentication
GCS: "Could not automatically determine credentials"
gcloud auth application-default loginS3: "Unable to locate credentials"
aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."Azure: "Azure credentials not found"
# Option 1: Set connection string
export AZURE_STORAGE_CONNECTION_STRING="..."
# Option 2: Use Azure CLI
export AZURE_STORAGE_ACCOUNT_URL="https://account.blob.core.windows.net"
az loginDirectory Operations
CloudCat intelligently handles directories containing multiple data files, common with Spark, Hive, and distributed processing outputs.
Multi-File Mode
Control how CloudCat handles directories with the -m, --multi-file-mode option:
| Mode | Description |
|---|---|
auto |
Smart selection based on directory contents (default) |
first |
Read only the first data file found |
all |
Combine data from all files in the directory |
Auto Mode (Default)
In auto mode, CloudCat analyzes the directory and makes smart decisions:
cloudcat -p s3://bucket/spark-output/- Scans directory for data files
- Ignores metadata files (
_SUCCESS,_metadata,.crc, etc.) - Selects appropriate files based on format
- Reports which files were selected
First File Mode
Read only the first file for quick sampling:
cloudcat -p gcs://bucket/large-output/ -m firstBest for:
- Quick data validation
- Large directories with many files
- When you only need a sample
All Files Mode
Combine data from multiple files:
cloudcat -p s3://bucket/daily-logs/ -m allBest for:
- Aggregating partitioned data
- Reading complete datasets
- Directories with related files
Size Limits
Control maximum data size when reading multiple files:
# Read up to 100MB of data
cloudcat -p gcs://bucket/events/ -m all --max-size-mb 100Default is 25MB to prevent accidentally loading huge datasets.
Automatic File Filtering
CloudCat automatically ignores common metadata files:
_SUCCESS- Spark/Hadoop success markers_metadata- Parquet metadata files_common_metadata- Parquet common metadata.crcfiles - Checksum files.committed- Transaction markers.pending- Pending transaction files_temporarydirectories - Temporary files
Examples
Spark Output Directory
# Typical Spark output structure:
# s3://bucket/output/
# _SUCCESS
# part-00000-abc.parquet
# part-00001-def.parquet
cloudcat -p s3://bucket/output/ -i parquet
# Automatically reads part files, ignores _SUCCESSHive Partitioned Data
# Partitioned structure:
# gcs://bucket/events/
# year=2024/month=01/data.parquet
# year=2024/month=02/data.parquet
cloudcat -p gcs://bucket/events/ -m all -i parquetDaily Log Files
# Log directory:
# s3://bucket/logs/
# 2024-01-15.json
# 2024-01-16.json
# 2024-01-17.json
cloudcat -p s3://bucket/logs/ -m all -n 100Large Directory Sampling
# Quick preview of first file only
cloudcat -p gcs://bucket/huge-dataset/ -m first -n 20Format Detection in Directories
When reading from a directory, you may want to specify the format:
# Explicitly set format for directory
cloudcat -p s3://bucket/output/ -i parquet
# Auto-detect from first matching file
cloudcat -p gcs://bucket/data/CloudCat examines file extensions to determine format when not specified.
Tips
- Use
-m firstfor quick validation of large directories - Use
--max-size-mbto control memory usage with-m all - Specify
-iformat when directory contains mixed file types - CloudCat preserves column order across multiple files
Output Formats
CloudCat supports multiple output formats to suit different workflows. Use the -o, --output-format option to choose.
Table (Default)
Beautiful ASCII tables with colored headers, perfect for terminal viewing:
cloudcat -p gcs://bucket/data.csvOutput:
┌────────┬─────────────┬─────────────────────┬────────────────────┐
│ id │ name │ email │ created_at │
├────────┼─────────────┼─────────────────────┼────────────────────┤
│ 1 │ Alice │ alice@example.com │ 2024-01-15 10:30 │
│ 2 │ Bob │ bob@example.com │ 2024-01-15 11:45 │
│ 3 │ Charlie │ charlie@example.com │ 2024-01-16 09:00 │
└────────┴─────────────┴─────────────────────┴────────────────────┘Best for:
- Interactive terminal use
- Quick data inspection
- Readable output
JSON Lines
Standard JSON Lines format (one JSON object per line):
cloudcat -p s3://bucket/data.parquet -o jsonOutput:
{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}Best for:
- Piping to
jqfor processing - Integration with other tools
- Machine-readable output
Processing with jq
# Filter by field
cloudcat -p s3://bucket/events.json -o json | jq 'select(.status == "error")'
# Extract specific fields
cloudcat -p gcs://bucket/users.parquet -o json | jq '.email'
# Count by field
cloudcat -p s3://bucket/logs.json -o json -n 0 | jq -s 'group_by(.level) | map({level: .[0].level, count: length})'Pretty JSON
Syntax-highlighted, indented JSON for human readability:
cloudcat -p gcs://bucket/config.json -o jsonpOutput:
{
"id": 1,
"name": "Alice",
"metadata": {
"created": "2024-01-15",
"tags": ["user", "active"]
}
}Best for:
- Viewing nested JSON structures
- Debugging API responses
- Human-readable inspection
CSV
Comma-separated values for export and further processing:
cloudcat -p s3://bucket/data.parquet -o csvOutput:
id,name,email,created_at
1,Alice,alice@example.com,2024-01-15 10:30
2,Bob,bob@example.com,2024-01-15 11:45
3,Charlie,charlie@example.com,2024-01-16 09:00Best for:
- Exporting to spreadsheets
- Further data processing
- Format conversion
Export Examples
Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csvExport Specific Columns
cloudcat -p s3://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csvExport Filtered Data
cloudcat -p gcs://bucket/events.json --where "status=error" -o csv -n 0 > errors.csvConvert JSON to CSV
cloudcat -p s3://bucket/api-response.json -o csv > response.csvCombining Output Formats with Other Options
# Table with column selection
cloudcat -p gcs://bucket/data.csv -c id,name,email -o table
# JSON with filtering
cloudcat -p s3://bucket/users.parquet --where "active=true" -o json
# CSV with row limit
cloudcat -p gcs://bucket/events.json -o csv -n 100
# Pretty JSON with schema
cloudcat -p s3://bucket/config.json -o jsonp -s showTips
- Use
-n 0to output all rows when exporting - Use
tablefor interactive inspection,jsonfor piping,csvfor export - Redirect output to a file with
> filenamefor large exports - The
jsonpformat includes colors; redirect to file loses color codes
Use Cases
Real-world scenarios where CloudCat shines.
Debugging Spark Jobs
Quickly validate Spark job output without downloading files:
# Check output of a Spark job
cloudcat -p gcs://data-lake/jobs/daily-etl/output/ -i parquet -n 20
# Verify schema matches expectations
cloudcat -p s3://analytics/spark-output/ -s schema_only
# Sample data from large output
cloudcat -p gcs://bucket/aggregations/ -m first -n 50Log Analysis
Preview and filter log files stored in cloud storage:
# Preview recent logs
cloudcat -p gcs://logs/app/2024-01-15/ -m all -n 50
# Filter for errors
cloudcat -p s3://logs/api/ --where "level=ERROR" -n 100
# Search log messages
cloudcat -p gcs://logs/app/ --where "message contains timeout"
# Export errors for analysis
cloudcat -p s3://logs/errors/ -o json -n 0 | jq 'select(.status >= 500)'Data Validation
Verify data quality and structure before processing:
# Quick sanity check on data export
cloudcat -p gcs://exports/daily/users.csv -s show
# Verify record count
cloudcat -p s3://warehouse/transactions.parquet
# Check for null values (preview and inspect)
cloudcat -p gcs://data/customers.parquet -n 100
# Validate schema before ETL
cloudcat -p s3://input/raw-data.json -s schema_onlyFormat Conversion
Convert between data formats using CloudCat:
# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv
# Convert JSON to CSV for spreadsheet import
cloudcat -p s3://api-dumps/response.json -o csv > data.csv
# Convert tab-separated to comma-separated
cloudcat -p gcs://imports/data.tsv -d "\t" -o csv > converted.csv
# Export Avro as JSON Lines
cloudcat -p s3://kafka/events.avro -o json -n 0 > events.jsonlData Exploration
Understand unfamiliar datasets quickly:
# View schema of unknown file
cloudcat -p s3://vendor-data/export.parquet -s schema_only
# Preview first few rows
cloudcat -p gcs://bucket/new-data.csv -n 5
# Check all columns
cloudcat -p s3://bucket/wide-table.parquet -n 3Data Sampling
Get representative samples from large datasets:
# Random-ish sample (use offset)
cloudcat -p gcs://bucket/huge-table.parquet --offset 10000 -n 100
# Sample from each partition
cloudcat -p s3://bucket/year=2024/month=01/ -m first -n 50
cloudcat -p s3://bucket/year=2024/month=02/ -m first -n 50
# Quick peek at different columns
cloudcat -p gcs://bucket/data.parquet -c user_id,event_type -n 20
cloudcat -p gcs://bucket/data.parquet -c timestamp,value -n 20Pipeline Debugging
Debug data pipeline issues:
# Check intermediate outputs
cloudcat -p s3://pipeline/stage1-output/ -i parquet -n 10
cloudcat -p s3://pipeline/stage2-output/ -i parquet -n 10
# Compare schemas between stages
cloudcat -p gcs://etl/raw/ -s schema_only
cloudcat -p gcs://etl/transformed/ -s schema_only
# Find records with specific IDs
cloudcat -p s3://data/users.parquet --where "user_id=12345"Kafka/Event Streaming
Preview data from Kafka exports:
# Read Avro files from Kafka Connect
cloudcat -p s3://kafka-exports/topic-name/ -i avro
# Filter events by type
cloudcat -p gcs://events/user-actions/ --where "event_type=purchase"
# Preview JSON events
cloudcat -p s3://kinesis/events.jsonl -o jsonpMulti-Cloud Data Access
Work with data across multiple cloud providers:
# Compare data between clouds
cloudcat -p gcs://source/data.parquet -c id,value -n 100
cloudcat -p s3://destination/data.parquet -c id,value -n 100
# Verify replication
cloudcat -p gcs://primary/users.csv
cloudcat -p az://backup/users.csvIntegration with Other Tools
Combine CloudCat with other command-line tools:
# Count records with wc
cloudcat -p s3://bucket/data.csv -o csv -n 0 | wc -l
# Filter with grep
cloudcat -p gcs://logs/app.json -o json | grep "ERROR"
# Process with awk
cloudcat -p s3://data/report.csv -o csv | awk -F',' '{sum+=$3} END {print sum}'
# Sort and unique
cloudcat -p gcs://data/users.csv -c country -o csv -n 0 | sort | uniq -cPerformance Tips
Optimize CloudCat for faster performance and lower data transfer costs.
1. Use --no-count for Large Files
By default, CloudCat counts total records in the file. For large files, skip this:
cloudcat -p s3://bucket/huge-file.csv --no-countThis is especially helpful for CSV and JSON files where counting requires scanning the entire file.
2. Prefer Parquet Format
Parquet files offer the best performance with CloudCat:
- Instant record counts from metadata (no file scan needed)
- Column pruning when using
--columns(only reads selected columns) - Better compression means less data transfer
# Record count is instant for Parquet
cloudcat -p gcs://bucket/data.parquet
# Column selection reads only needed columns
cloudcat -p s3://bucket/wide-table.parquet -c id,name,email3. Limit Rows with --num-rows
Reduce data transfer by limiting rows:
# Preview only 20 rows instead of default 10
cloudcat -p gcs://bucket/data.csv -n 20
# Don't use -n 0 (all rows) unless you need everything4. Select Only Needed Columns
With columnar formats (Parquet, ORC), column selection reduces data transfer:
# Reads only 3 columns instead of all 50
cloudcat -p s3://bucket/wide-table.parquet -c user_id,event_type,timestamp5. Use First File Mode for Directories
When you only need a sample from a directory with many files:
# Read only the first file
cloudcat -p gcs://bucket/spark-output/ -m first
# Instead of reading all files
cloudcat -p gcs://bucket/spark-output/ -m all6. Set Appropriate Size Limits
Control memory usage when reading multiple files:
# Limit to 10MB for quick preview
cloudcat -p s3://bucket/logs/ -m all --max-size-mb 10
# Increase for complete datasets
cloudcat -p s3://bucket/data/ -m all --max-size-mb 1007. Use Schema-Only for Structure Checks
When you only need to check the schema:
# Instant - doesn't read data
cloudcat -p gcs://bucket/data.parquet -s schema_only8. Compression Considerations
CloudCat handles compressed files efficiently:
- Gzip/Bzip2 - Built-in, always available
- Zstandard - Fast decompression, good for large files
- LZ4 - Fastest decompression
- Snappy - Good balance of speed and ratio
For best performance with large files, prefer zstd or lz4:
cloudcat -p s3://bucket/data.csv.zst -n 1009. Network Considerations
CloudCat streams data, so network latency matters:
- Run CloudCat close to your data (same region)
- Use AWS EC2/GCP Compute/Azure VMs in the same region as your buckets
- For local development, expect slower performance due to network transfer
10. Memory Management
For very large previews, be mindful of memory:
# This loads all data into memory
cloudcat -p s3://bucket/huge.parquet -n 0
# Better: limit rows
cloudcat -p s3://bucket/huge.parquet -n 1000Performance Comparison
| Operation | CSV | JSON | Parquet |
|---|---|---|---|
| Record Count | Slow (full scan) | Slow (full scan) | Instant (metadata) |
| Column Selection | Full file read | Full file read | Reads only selected |
| First N Rows | Fast (stops early) | Fast (stops early) | Fast |
| Compression | Standard | Standard | Built-in, efficient |
Quick Reference
| Goal | Recommendation |
|---|---|
| Fastest preview | -n 10 --no-count |
| Check structure | -s schema_only |
| Large directories | -m first |
| Wide tables | -c col1,col2,col3 |
| Memory efficiency | Set reasonable -n value |
Troubleshooting
Solutions to common issues when using CloudCat.
Missing Package Errors
"google-cloud-storage package is required"
pip install cloudcat
# or
pip install google-cloud-storage"boto3 package is required"
pip install cloudcat
# or
pip install boto3"azure-storage-blob package is required"
pip install cloudcat
# or
pip install azure-storage-blob azure-identity"pyarrow package is required"
For Parquet or ORC file support:
pip install 'cloudcat[parquet]'
# or
pip install pyarrow"fastavro package is required"
For Avro file support:
pip install 'cloudcat[avro]'
# or
pip install fastavro"zstandard package is required for .zst files"
pip install 'cloudcat[zstd]'
# or for all compression:
pip install 'cloudcat[compression]'"lz4 package is required for .lz4 files"
pip install 'cloudcat[lz4]'"python-snappy package is required for .snappy files"
pip install 'cloudcat[snappy]'Authentication Errors
GCS: "Could not automatically determine credentials"
Set up Google Cloud authentication:
# Option 1: User credentials
gcloud auth application-default login
# Option 2: Service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Option 3: CLI option
cloudcat -p gcs://bucket/file.csv --credentials /path/to/key.jsonS3: "Unable to locate credentials"
Set up AWS authentication:
# Option 1: Configure AWS CLI
aws configure
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
# Option 3: Use named profile
cloudcat -p s3://bucket/file.csv --profile myprofileAzure: "Azure credentials not found"
Set up Azure authentication:
# Option 1: Connection string
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"
# Option 2: Azure AD
export AZURE_STORAGE_ACCOUNT_URL="https://account.blob.core.windows.net"
az login
# Option 3: Specify account
cloudcat -p az://container/file.csv --account mystorageaccountFormat Detection Issues
"Could not infer format from path"
When CloudCat can't determine the file format:
# Specify the format explicitly
cloudcat -p gcs://bucket/data -i parquet
cloudcat -p s3://bucket/file -i csv
cloudcat -p az://container/logs -i jsonReading files without extensions
cloudcat -p s3://bucket/data-file -i parquetAccess Permission Errors
"Access Denied" or "403 Forbidden"
Check that your credentials have the necessary permissions:
GCS:
storage.objects.getfor reading filesstorage.objects.listfor listing directories
S3:
s3:GetObjectfor reading filess3:ListBucketfor listing directories
Azure:
Storage Blob Data Readerrole or equivalent
Network Issues
Timeout errors
For slow connections or large files:
- Use
--num-rowsto limit data transfer - Use
--no-countto skip record counting - Check network connectivity to the cloud provider
"Connection reset" errors
May indicate network instability. Try:
# Smaller preview
cloudcat -p s3://bucket/file.csv -n 10 --no-countMemory Issues
"MemoryError" or system slowdown
When previewing large files:
# Limit rows
cloudcat -p gcs://bucket/huge.parquet -n 100
# Don't load all rows
# Avoid: cloudcat -p s3://bucket/huge.csv -n 0
# Limit directory size
cloudcat -p s3://bucket/large-dir/ -m all --max-size-mb 10CSV Issues
Wrong columns or parsing errors
For non-standard CSV files:
# Tab-separated
cloudcat -p gcs://bucket/data.tsv -d "\t"
# Pipe-delimited
cloudcat -p s3://bucket/data.txt -d "|"
# Semicolon-delimited
cloudcat -p gcs://bucket/data.csv -d ";"Directory Issues
"No data files found in directory"
Check that:
- The directory contains files with recognized extensions
- Files aren't all metadata files (
_SUCCESS,.crc, etc.) - You have permission to list the directory
# Specify format explicitly
cloudcat -p s3://bucket/output/ -i parquetGetting Help
If you're still having issues:
- Check you're using the latest version:
pip install --upgrade cloudcat - Try with
--helpto see all options - Open an issue on GitHub with:
- CloudCat version
- Python version
- Full error message
- Command that caused the error
Contributing
Contributions are welcome! Here's how you can help improve CloudCat.
Ways to Contribute
- Report Bugs - Open an issue with reproduction steps
- Suggest Features - Open an issue describing the use case
- Submit PRs - Fork, create a branch, and submit a pull request
- Improve Docs - Help make the documentation better
- Share CloudCat - Star the repo and spread the word
Development Setup
Clone and set up the development environment:
# Clone the repository
git clone https://github.com/jonathansudhakar1/cloudcat.git
cd cloudcat
# Create virtual environment
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install in development mode with all dependencies
pip install -e ".[all]"
# Run tests
pytestProject Structure
cloudcat/
├── cloudcat/
│ ├── __init__.py # Version info
│ ├── cli.py # Main CLI entry point
│ ├── config.py # Configuration management
│ ├── compression.py # Compression handling
│ ├── filtering.py # WHERE clause parsing
│ ├── formatters.py # Output formatting
│ ├── readers/ # Format readers
│ │ ├── csv.py
│ │ ├── json.py
│ │ ├── parquet.py
│ │ ├── avro.py
│ │ ├── orc.py
│ │ └── text.py
│ └── storage/ # Cloud storage clients
│ ├── base.py
│ ├── gcs.py
│ ├── s3.py
│ └── azure.py
├── tests/
├── docs/
├── setup.py
└── README.mdSubmitting Pull Requests
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes
- Add tests if applicable
- Run tests:
pytest - Commit your changes:
git commit -m "Add my feature" - Push to your fork:
git push origin feature/my-feature - Open a pull request
Code Style
- Follow PEP 8 guidelines
- Use meaningful variable names
- Add docstrings to functions
- Keep functions focused and small
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=cloudcat
# Run specific test file
pytest tests/test_cli.pyAdding a New File Format
To add support for a new file format:
- Create a new reader in
cloudcat/readers/ - Export a
read_*_data()function that returns(DataFrame, schema) - Update
cli.pyto handle the new format - Add format detection in
cli.py - Add tests
- Update documentation
Adding a New Cloud Provider
To add support for a new cloud provider:
- Create a new client in
cloudcat/storage/ - Implement
get_stream()andlist_directory()functions - Update
cloudcat/storage/base.pyto route to the new provider - Add authentication handling
- Add tests
- Update documentation
Reporting Issues
When reporting bugs, please include:
- CloudCat version (
pip show cloudcat) - Python version (
python --version) - Operating system
- Full error message/traceback
- Minimal reproduction steps
- Sample data (if possible and not sensitive)
Feature Requests
When suggesting features:
- Describe the use case
- Explain how you'd like it to work
- Consider if it fits CloudCat's scope
- Be open to discussion on implementation
License
By contributing, you agree that your contributions will be licensed under the MIT License.
Roadmap
CloudCat is actively developed. Here's what's been accomplished and what's planned.
Completed Features
- Google Cloud Storage support - Full GCS integration
- Amazon S3 support - Full S3 integration with profiles
- Azure Blob Storage support - Full Azure integration
- CSV format - With custom delimiters
- JSON format - Standard JSON and JSON Lines
- Parquet format - With efficient column selection
- Avro format - Full Avro support
- ORC format - Via PyArrow
- Plain text format - For log files
- SQL-like filtering - WHERE clause support
- Compression support - gzip, bz2, zstd, lz4, snappy
- Row offset/pagination - Skip and limit rows
- Schema inspection - View data types
- Multi-file directories - Spark/Hive output support
- Multiple output formats - table, json, jsonp, csv
Planned Features
- Interactive mode - Pagination with keyboard navigation
- Output to file - Direct
--output-fileoption - Configuration file -
.cloudcatrcfor defaults - Multiple WHERE conditions - AND/OR operators
- Sampling - Random row sampling
- Profile support - Named configuration profiles
- Delta Lake support - Read Delta tables
- Iceberg support - Read Iceberg tables
Under Consideration
- Write support - Converting and writing data
- SQL queries - Full SQL query support via DuckDB
- Data profiling - Basic statistics and profiling
- Diff mode - Compare two files
- Watch mode - Monitor file changes
- Plugins - Custom reader/writer plugins
Version History
v0.2.2 (Current)
- Bug fixes and improvements
- Homebrew support for Apple Silicon
v0.2.0
- Azure Blob Storage support
- Avro and ORC format support
- WHERE clause filtering
- Row offset/pagination
- Compression support (zstd, lz4, snappy)
v0.1.0
- Initial release
- GCS and S3 support
- CSV, JSON, Parquet formats
- Basic functionality
Contributing to the Roadmap
Have a feature idea? We'd love to hear it!
- Check existing issues for similar requests
- Open a new issue describing:
- The use case
- How it would work
- Why it would be valuable
- Join the discussion
Feedback
Your feedback shapes CloudCat's development:
- Star the repo to show support
- Open issues for bugs or features
- Contribute pull requests