CloudCat

The Swiss Army knife for viewing cloud storage data from your terminal

macOS Apple Silicon only brew install jonathansudhakar1/cloudcat/cloudcat
Google Cloud Storage Amazon S3 Azure Blob Storage
Terminal
$ cloudcat -p gcs://bucket/data.parquet -n 5

Reading from: gcs://bucket/data.parquet
Format: parquet | Records: 1,234,567

Schema:
  id: int64
  name: string
  email: string
  created_at: timestamp

┌────────┬─────────────┬─────────────────────┬────────────────────┐
│ id     │ name        │ email               │ created_at         │
├────────┼─────────────┼─────────────────────┼────────────────────┤
│ 1      │ Alice       │ alice@example.com   │ 2024-01-15 10:30   │
│ 2      │ Bob         │ bob@example.com     │ 2024-01-15 11:45   │
│ 3      │ Charlie     │ charlie@example.com │ 2024-01-16 09:00   │
│ 4      │ Diana       │ diana@example.com   │ 2024-01-16 14:20   │
│ 5      │ Eve         │ eve@example.com     │ 2024-01-17 08:15   │
└────────┴─────────────┴─────────────────────┴────────────────────┘

Features

CloudCat is designed to make previewing cloud data effortless. Here's what it offers:

Cloud Storage Support

Provider URL Scheme Status
Google Cloud Storage gcs:// or gs:// Supported
Amazon S3 s3:// Supported
Azure Blob Storage az:// or azure:// Supported

File Format Support

CloudCat automatically detects file formats from extensions and handles them appropriately:

Format Read Auto-Detect Streaming Use Case
CSV Yes Yes Yes General data files
JSON Yes Yes Yes API responses, configs
JSON Lines Yes Yes Yes Log files, streaming data
Parquet Yes Yes Yes Spark/analytics data
Avro Yes Yes Yes Kafka, data pipelines
ORC Yes Yes Yes Hive, Hadoop ecosystem
Text Yes Yes Yes Log files, plain text
TSV Yes Via --delimiter Yes Tab-separated data

Compression Support

CloudCat automatically detects and decompresses files based on extension:

Format Extension Built-in Installation
Gzip .gz, .gzip Yes Included
Bzip2 .bz2 Yes Included
Zstandard .zst, .zstd Optional pip install cloudcat[zstd]
LZ4 .lz4 Optional pip install cloudcat[lz4]
Snappy .snappy Optional pip install cloudcat[snappy]

Output Formats

Format Flag Description
Table -o table Beautiful ASCII table with colored headers (default)
JSON -o json Standard JSON Lines output (one record per line)
Pretty JSON -o jsonp Syntax-highlighted, indented JSON with colors
CSV -o csv Comma-separated values for further processing

Key Capabilities

  • Schema Inspection - View column names and data types before previewing data
  • Column Selection - Display only the columns you need with --columns
  • Row Limiting - Control how many rows to preview with --num-rows
  • Row Offset - Skip first N rows for pagination with --offset
  • WHERE Filtering - Filter rows with SQL-like conditions using --where
  • Record Counting - Get total record counts (instant for Parquet via metadata)
  • Multi-File Reading - Combine data from multiple files in a directory
  • Custom Delimiters - Support for tab, pipe, semicolon, and other delimiters
  • Auto Decompression - Transparent handling of compressed files
  • Directory Intelligence - Automatically discovers data files in Spark/Hive outputs

Installation

Homebrew (macOS Apple Silicon)

The easiest way to install on Apple Silicon Macs (M1/M2/M3/M4) — no Python required:

brew install jonathansudhakar1/cloudcat/cloudcat

This installs a self-contained binary that includes Python and all dependencies.

Intel Mac users: Homebrew bottles are not available for Intel. Please use pip install 'cloudcat[all]' instead.

To upgrade:

brew upgrade cloudcat

Note: On first run, macOS may block the app. Go to System Settings > Privacy & Security and click "Allow", or run:

xattr -d com.apple.quarantine $(which cloudcat)

pip (All Platforms)

Install CloudCat with all features enabled:

pip install 'cloudcat[all]'

This includes support for all cloud providers (GCS, S3, Azure), all file formats (Parquet, Avro, ORC), and all compression types (zstd, lz4, snappy).

Standard pip Installation

For basic functionality with GCS, S3, and Azure support:

pip install cloudcat

Includes CSV, JSON, and text format support with gzip and bz2 compression.

Install with Specific Features

Install only what you need:

Extra Command Adds Support For
parquet pip install 'cloudcat[parquet]' Apache Parquet files
avro pip install 'cloudcat[avro]' Apache Avro files
orc pip install 'cloudcat[orc]' Apache ORC files
compression pip install 'cloudcat[compression]' zstd, lz4, snappy
zstd pip install 'cloudcat[zstd]' Zstandard compression only
lz4 pip install 'cloudcat[lz4]' LZ4 compression only
snappy pip install 'cloudcat[snappy]' Snappy compression only

Requirements

  • Homebrew: macOS Apple Silicon (M1/M2/M3/M4). Intel Mac users should use pip.
  • pip: Python 3.7 or higher (all platforms)
  • Cloud Credentials: Configured for your cloud provider (see Authentication)

Note: If using zsh (default on macOS), quotes around extras are required to prevent shell interpretation of brackets.

Upgrading

Upgrade to the latest version:

pip install --upgrade cloudcat

Or with all extras:

pip install --upgrade 'cloudcat[all]'

Verifying Installation

Check that CloudCat is installed correctly:

cloudcat --help

You should see the help output with all available options.

Quick Start

Get started with CloudCat in seconds. Here are the most common operations:

Preview a CSV File

# From Google Cloud Storage
cloudcat -p gcs://my-bucket/data.csv

# From Amazon S3
cloudcat -p s3://my-bucket/data.csv

# From Azure Blob Storage
cloudcat -p az://my-container/data.csv

Preview Parquet Files

# Preview first 10 rows (default)
cloudcat -p s3://my-bucket/analytics/events.parquet

# Preview 50 rows
cloudcat -p gcs://my-bucket/data.parquet -n 50

Preview JSON Data

# Standard JSON
cloudcat -p s3://my-bucket/config.json

# JSON Lines (newline-delimited JSON)
cloudcat -p gcs://my-bucket/events.jsonl

# With pretty formatting
cloudcat -p az://my-container/logs.json -o jsonp

Select Specific Columns

cloudcat -p gcs://bucket/users.json -c id,name,email

Filter Rows

# Exact match
cloudcat -p s3://bucket/users.parquet --where "status=active"

# Numeric comparison
cloudcat -p gcs://bucket/events.json --where "age>30"

# String contains
cloudcat -p s3://bucket/logs.csv --where "message contains error"

View Schema Only

cloudcat -p s3://bucket/events.parquet -s schema_only

Read Compressed Files

CloudCat automatically decompresses files:

# Gzip
cloudcat -p gcs://bucket/data.csv.gz

# Zstandard
cloudcat -p s3://bucket/events.parquet.zst

# LZ4
cloudcat -p s3://bucket/data.csv.lz4

Read from Spark Output Directory

cloudcat -p s3://my-bucket/spark-output/ -i parquet

CloudCat automatically discovers data files and ignores metadata files like _SUCCESS.

Pagination

# Skip first 100 rows, show next 10
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10

Convert and Export

# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

# Export specific columns
cloudcat -p s3://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv

# Pipe to jq for JSON processing
cloudcat -p s3://bucket/events.json -o json | jq '.status'

Command Reference

Complete reference for all CloudCat command-line options.

Usage

cloudcat [OPTIONS]

Required Options

Option Description
-p, --path TEXT Cloud storage path (required). Format: gcs://bucket/path, s3://bucket/path, or az://container/path

Output & Format Options

Option Default Description
-o, --output-format TEXT table Output format: table, json, jsonp, csv
-i, --input-format TEXT auto-detect Input format: csv, json, parquet, avro, orc, text

Data Selection Options

Option Default Description
-c, --columns TEXT all Comma-separated list of columns to display
-n, --num-rows INTEGER 10 Number of rows to display (0 for all rows)
--offset INTEGER 0 Skip first N rows

Filtering & Schema Options

Option Default Description
-w, --where TEXT none Filter rows with SQL-like conditions
-s, --schema TEXT show Schema display: show, dont_show, schema_only
--no-count false Disable automatic record counting

Directory Handling Options

Option Default Description
-m, --multi-file-mode TEXT auto Directory handling: auto, first, all
--max-size-mb INTEGER 25 Max data size for multi-file mode in MB

CSV Options

Option Default Description
-d, --delimiter TEXT comma CSV delimiter (use \t for tab)

Cloud Provider Authentication

Option Description
--profile TEXT AWS profile name (for S3 access)
--project TEXT GCP project ID (for GCS access)
--credentials TEXT Path to GCP service account JSON file
--account TEXT Azure storage account name

General Options

Option Description
--help Show help message and exit

Examples

# Basic usage
cloudcat -p gcs://bucket/data.csv

# Select columns and limit rows
cloudcat -p s3://bucket/users.parquet -c id,name,email -n 20

# Filter with WHERE clause
cloudcat -p gcs://bucket/events.json --where "status=active"

# Output as JSON
cloudcat -p az://container/data.csv -o json

# Read from Spark output directory
cloudcat -p s3://bucket/spark-output/ -i parquet -m all

# Use custom delimiter for TSV
cloudcat -p gcs://bucket/data.tsv -d "\t"

# Pagination
cloudcat -p s3://bucket/large.csv --offset 100 -n 10

# Schema only
cloudcat -p gcs://bucket/events.parquet -s schema_only

# With AWS profile
cloudcat -p s3://bucket/data.csv --profile production

# With GCP credentials
cloudcat -p gcs://bucket/data.csv --credentials /path/to/key.json

WHERE Operators

CloudCat supports SQL-like filtering with the --where option. Filter your data before it's displayed to focus on exactly what you need.

Supported Operators

Operator Example Description
= status=active Exact match
!= type!=deleted Not equal
> age>30 Greater than
< price<100 Less than
>= count>=10 Greater than or equal
<= score<=50 Less than or equal
contains name contains john Case-insensitive substring match
startswith email startswith admin String prefix match
endswith file endswith .csv String suffix match

Usage Examples

Exact Match

# Filter by status
cloudcat -p s3://bucket/users.parquet --where "status=active"

# Filter by category
cloudcat -p gcs://bucket/products.json --where "category=electronics"

Numeric Comparisons

# Greater than
cloudcat -p s3://bucket/users.parquet --where "age>30"

# Less than
cloudcat -p gcs://bucket/orders.csv --where "price<100"

# Greater than or equal
cloudcat -p s3://bucket/events.json --where "count>=10"

# Less than or equal
cloudcat -p gcs://bucket/scores.parquet --where "score<=50"

String Matching

# Contains (case-insensitive)
cloudcat -p s3://bucket/logs.json --where "message contains error"

# Starts with
cloudcat -p gcs://bucket/users.csv --where "email startswith admin"

# Ends with
cloudcat -p s3://bucket/files.json --where "filename endswith .csv"

Not Equal

# Exclude deleted records
cloudcat -p gcs://bucket/records.parquet --where "status!=deleted"

# Exclude specific type
cloudcat -p s3://bucket/events.json --where "type!=test"

Combining with Other Options

# Filter and select columns
cloudcat -p s3://bucket/users.parquet --where "status=active" -c id,name,email

# Filter and limit rows
cloudcat -p gcs://bucket/events.json --where "type=error" -n 50

# Filter with pagination
cloudcat -p s3://bucket/logs.csv --where "level=ERROR" --offset 100 -n 20

# Filter and export
cloudcat -p gcs://bucket/users.parquet --where "country=US" -o csv -n 0 > us_users.csv

Tips

  • String values don't need quotes in the WHERE clause
  • Comparisons are type-aware (numeric columns compare numerically)
  • The contains, startswith, and endswith operators are case-insensitive
  • For best performance, filter on columns that exist in your data

Authentication

CloudCat uses standard authentication methods for each cloud provider. Configure your credentials once and CloudCat will use them automatically.

Google Cloud Storage (GCS)

CloudCat uses Application Default Credentials (ADC) for GCS authentication.

Option 1: User Credentials (Development)

Best for local development:

gcloud auth application-default login

This opens a browser for Google account authentication.

Option 2: Service Account (Environment Variable)

Set the path to your service account JSON file:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Then use CloudCat normally:

cloudcat -p gcs://bucket/data.csv

Option 3: Service Account (CLI Option)

Pass the credentials file directly:

cloudcat -p gcs://bucket/data.csv --credentials /path/to/service-account.json

Option 4: Specify GCP Project

If your credentials have access to multiple projects:

cloudcat -p gcs://bucket/data.csv --project my-gcp-project

Amazon S3

CloudCat uses the standard AWS credential chain.

Option 1: Environment Variables

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

Option 2: AWS Credentials File

Configure credentials using the AWS CLI:

aws configure

This creates ~/.aws/credentials with your access keys.

Option 3: Named Profile

Use a specific AWS profile:

cloudcat -p s3://bucket/data.csv --profile production

Profiles are defined in ~/.aws/credentials:

[production]
aws_access_key_id = AKIA...
aws_secret_access_key = ...
region = us-west-2

Option 4: IAM Role (EC2/ECS/Lambda)

When running on AWS infrastructure (EC2, ECS, Lambda), CloudCat automatically uses the attached IAM role. No configuration needed.

Azure Blob Storage

CloudCat supports multiple authentication methods for Azure.

Option 1: Connection String (Simplest)

Set the full connection string:

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

Option 2: Azure AD Authentication

Use Azure CLI login with account URL:

# Set the account URL
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"

# Login with Azure CLI
az login

CloudCat will use DefaultAzureCredential to authenticate.

Option 3: Storage Account (CLI Option)

Specify the storage account directly:

cloudcat -p az://container/data.csv --account mystorageaccount

This requires either a connection string or Azure AD authentication to be configured.

Path Formats

Provider URL Format Example
GCS gcs://bucket/path or gs://bucket/path gcs://my-bucket/data/file.csv
S3 s3://bucket/path s3://my-bucket/data/file.parquet
Azure az://container/path or azure://container/path az://my-container/data/file.json

Troubleshooting Authentication

GCS: "Could not automatically determine credentials"

gcloud auth application-default login

S3: "Unable to locate credentials"

aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

Azure: "Azure credentials not found"

# Option 1: Set connection string
export AZURE_STORAGE_CONNECTION_STRING="..."

# Option 2: Use Azure CLI
export AZURE_STORAGE_ACCOUNT_URL="https://account.blob.core.windows.net"
az login

Directory Operations

CloudCat intelligently handles directories containing multiple data files, common with Spark, Hive, and distributed processing outputs.

Multi-File Mode

Control how CloudCat handles directories with the -m, --multi-file-mode option:

Mode Description
auto Smart selection based on directory contents (default)
first Read only the first data file found
all Combine data from all files in the directory

Auto Mode (Default)

In auto mode, CloudCat analyzes the directory and makes smart decisions:

cloudcat -p s3://bucket/spark-output/
  • Scans directory for data files
  • Ignores metadata files (_SUCCESS, _metadata, .crc, etc.)
  • Selects appropriate files based on format
  • Reports which files were selected

First File Mode

Read only the first file for quick sampling:

cloudcat -p gcs://bucket/large-output/ -m first

Best for:

  • Quick data validation
  • Large directories with many files
  • When you only need a sample

All Files Mode

Combine data from multiple files:

cloudcat -p s3://bucket/daily-logs/ -m all

Best for:

  • Aggregating partitioned data
  • Reading complete datasets
  • Directories with related files

Size Limits

Control maximum data size when reading multiple files:

# Read up to 100MB of data
cloudcat -p gcs://bucket/events/ -m all --max-size-mb 100

Default is 25MB to prevent accidentally loading huge datasets.

Automatic File Filtering

CloudCat automatically ignores common metadata files:

  • _SUCCESS - Spark/Hadoop success markers
  • _metadata - Parquet metadata files
  • _common_metadata - Parquet common metadata
  • .crc files - Checksum files
  • .committed - Transaction markers
  • .pending - Pending transaction files
  • _temporary directories - Temporary files

Examples

Spark Output Directory

# Typical Spark output structure:
# s3://bucket/output/
#   _SUCCESS
#   part-00000-abc.parquet
#   part-00001-def.parquet

cloudcat -p s3://bucket/output/ -i parquet
# Automatically reads part files, ignores _SUCCESS

Hive Partitioned Data

# Partitioned structure:
# gcs://bucket/events/
#   year=2024/month=01/data.parquet
#   year=2024/month=02/data.parquet

cloudcat -p gcs://bucket/events/ -m all -i parquet

Daily Log Files

# Log directory:
# s3://bucket/logs/
#   2024-01-15.json
#   2024-01-16.json
#   2024-01-17.json

cloudcat -p s3://bucket/logs/ -m all -n 100

Large Directory Sampling

# Quick preview of first file only
cloudcat -p gcs://bucket/huge-dataset/ -m first -n 20

Format Detection in Directories

When reading from a directory, you may want to specify the format:

# Explicitly set format for directory
cloudcat -p s3://bucket/output/ -i parquet

# Auto-detect from first matching file
cloudcat -p gcs://bucket/data/

CloudCat examines file extensions to determine format when not specified.

Tips

  • Use -m first for quick validation of large directories
  • Use --max-size-mb to control memory usage with -m all
  • Specify -i format when directory contains mixed file types
  • CloudCat preserves column order across multiple files

Output Formats

CloudCat supports multiple output formats to suit different workflows. Use the -o, --output-format option to choose.

Table (Default)

Beautiful ASCII tables with colored headers, perfect for terminal viewing:

cloudcat -p gcs://bucket/data.csv

Output:

┌────────┬─────────────┬─────────────────────┬────────────────────┐
│ id     │ name        │ email               │ created_at         │
├────────┼─────────────┼─────────────────────┼────────────────────┤
│ 1      │ Alice       │ alice@example.com   │ 2024-01-15 10:30   │
│ 2      │ Bob         │ bob@example.com     │ 2024-01-15 11:45   │
│ 3      │ Charlie     │ charlie@example.com │ 2024-01-16 09:00   │
└────────┴─────────────┴─────────────────────┴────────────────────┘

Best for:

  • Interactive terminal use
  • Quick data inspection
  • Readable output

JSON Lines

Standard JSON Lines format (one JSON object per line):

cloudcat -p s3://bucket/data.parquet -o json

Output:

{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}

Best for:

  • Piping to jq for processing
  • Integration with other tools
  • Machine-readable output

Processing with jq

# Filter by field
cloudcat -p s3://bucket/events.json -o json | jq 'select(.status == "error")'

# Extract specific fields
cloudcat -p gcs://bucket/users.parquet -o json | jq '.email'

# Count by field
cloudcat -p s3://bucket/logs.json -o json -n 0 | jq -s 'group_by(.level) | map({level: .[0].level, count: length})'

Pretty JSON

Syntax-highlighted, indented JSON for human readability:

cloudcat -p gcs://bucket/config.json -o jsonp

Output:

{
  "id": 1,
  "name": "Alice",
  "metadata": {
    "created": "2024-01-15",
    "tags": ["user", "active"]
  }
}

Best for:

  • Viewing nested JSON structures
  • Debugging API responses
  • Human-readable inspection

CSV

Comma-separated values for export and further processing:

cloudcat -p s3://bucket/data.parquet -o csv

Output:

id,name,email,created_at
1,Alice,alice@example.com,2024-01-15 10:30
2,Bob,bob@example.com,2024-01-15 11:45
3,Charlie,charlie@example.com,2024-01-16 09:00

Best for:

  • Exporting to spreadsheets
  • Further data processing
  • Format conversion

Export Examples

Convert Parquet to CSV

cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

Export Specific Columns

cloudcat -p s3://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv

Export Filtered Data

cloudcat -p gcs://bucket/events.json --where "status=error" -o csv -n 0 > errors.csv

Convert JSON to CSV

cloudcat -p s3://bucket/api-response.json -o csv > response.csv

Combining Output Formats with Other Options

# Table with column selection
cloudcat -p gcs://bucket/data.csv -c id,name,email -o table

# JSON with filtering
cloudcat -p s3://bucket/users.parquet --where "active=true" -o json

# CSV with row limit
cloudcat -p gcs://bucket/events.json -o csv -n 100

# Pretty JSON with schema
cloudcat -p s3://bucket/config.json -o jsonp -s show

Tips

  • Use -n 0 to output all rows when exporting
  • Use table for interactive inspection, json for piping, csv for export
  • Redirect output to a file with > filename for large exports
  • The jsonp format includes colors; redirect to file loses color codes

Use Cases

Real-world scenarios where CloudCat shines.

Debugging Spark Jobs

Quickly validate Spark job output without downloading files:

# Check output of a Spark job
cloudcat -p gcs://data-lake/jobs/daily-etl/output/ -i parquet -n 20

# Verify schema matches expectations
cloudcat -p s3://analytics/spark-output/ -s schema_only

# Sample data from large output
cloudcat -p gcs://bucket/aggregations/ -m first -n 50

Log Analysis

Preview and filter log files stored in cloud storage:

# Preview recent logs
cloudcat -p gcs://logs/app/2024-01-15/ -m all -n 50

# Filter for errors
cloudcat -p s3://logs/api/ --where "level=ERROR" -n 100

# Search log messages
cloudcat -p gcs://logs/app/ --where "message contains timeout"

# Export errors for analysis
cloudcat -p s3://logs/errors/ -o json -n 0 | jq 'select(.status >= 500)'

Data Validation

Verify data quality and structure before processing:

# Quick sanity check on data export
cloudcat -p gcs://exports/daily/users.csv -s show

# Verify record count
cloudcat -p s3://warehouse/transactions.parquet

# Check for null values (preview and inspect)
cloudcat -p gcs://data/customers.parquet -n 100

# Validate schema before ETL
cloudcat -p s3://input/raw-data.json -s schema_only

Format Conversion

Convert between data formats using CloudCat:

# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

# Convert JSON to CSV for spreadsheet import
cloudcat -p s3://api-dumps/response.json -o csv > data.csv

# Convert tab-separated to comma-separated
cloudcat -p gcs://imports/data.tsv -d "\t" -o csv > converted.csv

# Export Avro as JSON Lines
cloudcat -p s3://kafka/events.avro -o json -n 0 > events.jsonl

Data Exploration

Understand unfamiliar datasets quickly:

# View schema of unknown file
cloudcat -p s3://vendor-data/export.parquet -s schema_only

# Preview first few rows
cloudcat -p gcs://bucket/new-data.csv -n 5

# Check all columns
cloudcat -p s3://bucket/wide-table.parquet -n 3

Data Sampling

Get representative samples from large datasets:

# Random-ish sample (use offset)
cloudcat -p gcs://bucket/huge-table.parquet --offset 10000 -n 100

# Sample from each partition
cloudcat -p s3://bucket/year=2024/month=01/ -m first -n 50
cloudcat -p s3://bucket/year=2024/month=02/ -m first -n 50

# Quick peek at different columns
cloudcat -p gcs://bucket/data.parquet -c user_id,event_type -n 20
cloudcat -p gcs://bucket/data.parquet -c timestamp,value -n 20

Pipeline Debugging

Debug data pipeline issues:

# Check intermediate outputs
cloudcat -p s3://pipeline/stage1-output/ -i parquet -n 10
cloudcat -p s3://pipeline/stage2-output/ -i parquet -n 10

# Compare schemas between stages
cloudcat -p gcs://etl/raw/ -s schema_only
cloudcat -p gcs://etl/transformed/ -s schema_only

# Find records with specific IDs
cloudcat -p s3://data/users.parquet --where "user_id=12345"

Kafka/Event Streaming

Preview data from Kafka exports:

# Read Avro files from Kafka Connect
cloudcat -p s3://kafka-exports/topic-name/ -i avro

# Filter events by type
cloudcat -p gcs://events/user-actions/ --where "event_type=purchase"

# Preview JSON events
cloudcat -p s3://kinesis/events.jsonl -o jsonp

Multi-Cloud Data Access

Work with data across multiple cloud providers:

# Compare data between clouds
cloudcat -p gcs://source/data.parquet -c id,value -n 100
cloudcat -p s3://destination/data.parquet -c id,value -n 100

# Verify replication
cloudcat -p gcs://primary/users.csv
cloudcat -p az://backup/users.csv

Integration with Other Tools

Combine CloudCat with other command-line tools:

# Count records with wc
cloudcat -p s3://bucket/data.csv -o csv -n 0 | wc -l

# Filter with grep
cloudcat -p gcs://logs/app.json -o json | grep "ERROR"

# Process with awk
cloudcat -p s3://data/report.csv -o csv | awk -F',' '{sum+=$3} END {print sum}'

# Sort and unique
cloudcat -p gcs://data/users.csv -c country -o csv -n 0 | sort | uniq -c

Performance Tips

Optimize CloudCat for faster performance and lower data transfer costs.

1. Use --no-count for Large Files

By default, CloudCat counts total records in the file. For large files, skip this:

cloudcat -p s3://bucket/huge-file.csv --no-count

This is especially helpful for CSV and JSON files where counting requires scanning the entire file.

2. Prefer Parquet Format

Parquet files offer the best performance with CloudCat:

  • Instant record counts from metadata (no file scan needed)
  • Column pruning when using --columns (only reads selected columns)
  • Better compression means less data transfer
# Record count is instant for Parquet
cloudcat -p gcs://bucket/data.parquet

# Column selection reads only needed columns
cloudcat -p s3://bucket/wide-table.parquet -c id,name,email

3. Limit Rows with --num-rows

Reduce data transfer by limiting rows:

# Preview only 20 rows instead of default 10
cloudcat -p gcs://bucket/data.csv -n 20

# Don't use -n 0 (all rows) unless you need everything

4. Select Only Needed Columns

With columnar formats (Parquet, ORC), column selection reduces data transfer:

# Reads only 3 columns instead of all 50
cloudcat -p s3://bucket/wide-table.parquet -c user_id,event_type,timestamp

5. Use First File Mode for Directories

When you only need a sample from a directory with many files:

# Read only the first file
cloudcat -p gcs://bucket/spark-output/ -m first

# Instead of reading all files
cloudcat -p gcs://bucket/spark-output/ -m all

6. Set Appropriate Size Limits

Control memory usage when reading multiple files:

# Limit to 10MB for quick preview
cloudcat -p s3://bucket/logs/ -m all --max-size-mb 10

# Increase for complete datasets
cloudcat -p s3://bucket/data/ -m all --max-size-mb 100

7. Use Schema-Only for Structure Checks

When you only need to check the schema:

# Instant - doesn't read data
cloudcat -p gcs://bucket/data.parquet -s schema_only

8. Compression Considerations

CloudCat handles compressed files efficiently:

  • Gzip/Bzip2 - Built-in, always available
  • Zstandard - Fast decompression, good for large files
  • LZ4 - Fastest decompression
  • Snappy - Good balance of speed and ratio

For best performance with large files, prefer zstd or lz4:

cloudcat -p s3://bucket/data.csv.zst -n 100

9. Network Considerations

CloudCat streams data, so network latency matters:

  • Run CloudCat close to your data (same region)
  • Use AWS EC2/GCP Compute/Azure VMs in the same region as your buckets
  • For local development, expect slower performance due to network transfer

10. Memory Management

For very large previews, be mindful of memory:

# This loads all data into memory
cloudcat -p s3://bucket/huge.parquet -n 0

# Better: limit rows
cloudcat -p s3://bucket/huge.parquet -n 1000

Performance Comparison

Operation CSV JSON Parquet
Record Count Slow (full scan) Slow (full scan) Instant (metadata)
Column Selection Full file read Full file read Reads only selected
First N Rows Fast (stops early) Fast (stops early) Fast
Compression Standard Standard Built-in, efficient

Quick Reference

Goal Recommendation
Fastest preview -n 10 --no-count
Check structure -s schema_only
Large directories -m first
Wide tables -c col1,col2,col3
Memory efficiency Set reasonable -n value

Troubleshooting

Solutions to common issues when using CloudCat.

Missing Package Errors

"google-cloud-storage package is required"

pip install cloudcat
# or
pip install google-cloud-storage

"boto3 package is required"

pip install cloudcat
# or
pip install boto3

"azure-storage-blob package is required"

pip install cloudcat
# or
pip install azure-storage-blob azure-identity

"pyarrow package is required"

For Parquet or ORC file support:

pip install 'cloudcat[parquet]'
# or
pip install pyarrow

"fastavro package is required"

For Avro file support:

pip install 'cloudcat[avro]'
# or
pip install fastavro

"zstandard package is required for .zst files"

pip install 'cloudcat[zstd]'
# or for all compression:
pip install 'cloudcat[compression]'

"lz4 package is required for .lz4 files"

pip install 'cloudcat[lz4]'

"python-snappy package is required for .snappy files"

pip install 'cloudcat[snappy]'

Authentication Errors

GCS: "Could not automatically determine credentials"

Set up Google Cloud authentication:

# Option 1: User credentials
gcloud auth application-default login

# Option 2: Service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Option 3: CLI option
cloudcat -p gcs://bucket/file.csv --credentials /path/to/key.json

S3: "Unable to locate credentials"

Set up AWS authentication:

# Option 1: Configure AWS CLI
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# Option 3: Use named profile
cloudcat -p s3://bucket/file.csv --profile myprofile

Azure: "Azure credentials not found"

Set up Azure authentication:

# Option 1: Connection string
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

# Option 2: Azure AD
export AZURE_STORAGE_ACCOUNT_URL="https://account.blob.core.windows.net"
az login

# Option 3: Specify account
cloudcat -p az://container/file.csv --account mystorageaccount

Format Detection Issues

"Could not infer format from path"

When CloudCat can't determine the file format:

# Specify the format explicitly
cloudcat -p gcs://bucket/data -i parquet
cloudcat -p s3://bucket/file -i csv
cloudcat -p az://container/logs -i json

Reading files without extensions

cloudcat -p s3://bucket/data-file -i parquet

Access Permission Errors

"Access Denied" or "403 Forbidden"

Check that your credentials have the necessary permissions:

GCS:

  • storage.objects.get for reading files
  • storage.objects.list for listing directories

S3:

  • s3:GetObject for reading files
  • s3:ListBucket for listing directories

Azure:

  • Storage Blob Data Reader role or equivalent

Network Issues

Timeout errors

For slow connections or large files:

  • Use --num-rows to limit data transfer
  • Use --no-count to skip record counting
  • Check network connectivity to the cloud provider

"Connection reset" errors

May indicate network instability. Try:

# Smaller preview
cloudcat -p s3://bucket/file.csv -n 10 --no-count

Memory Issues

"MemoryError" or system slowdown

When previewing large files:

# Limit rows
cloudcat -p gcs://bucket/huge.parquet -n 100

# Don't load all rows
# Avoid: cloudcat -p s3://bucket/huge.csv -n 0

# Limit directory size
cloudcat -p s3://bucket/large-dir/ -m all --max-size-mb 10

CSV Issues

Wrong columns or parsing errors

For non-standard CSV files:

# Tab-separated
cloudcat -p gcs://bucket/data.tsv -d "\t"

# Pipe-delimited
cloudcat -p s3://bucket/data.txt -d "|"

# Semicolon-delimited
cloudcat -p gcs://bucket/data.csv -d ";"

Directory Issues

"No data files found in directory"

Check that:

  1. The directory contains files with recognized extensions
  2. Files aren't all metadata files (_SUCCESS, .crc, etc.)
  3. You have permission to list the directory
# Specify format explicitly
cloudcat -p s3://bucket/output/ -i parquet

Getting Help

If you're still having issues:

  1. Check you're using the latest version: pip install --upgrade cloudcat
  2. Try with --help to see all options
  3. Open an issue on GitHub with:
    • CloudCat version
    • Python version
    • Full error message
    • Command that caused the error

Contributing

Contributions are welcome! Here's how you can help improve CloudCat.

Ways to Contribute

  1. Report Bugs - Open an issue with reproduction steps
  2. Suggest Features - Open an issue describing the use case
  3. Submit PRs - Fork, create a branch, and submit a pull request
  4. Improve Docs - Help make the documentation better
  5. Share CloudCat - Star the repo and spread the word

Development Setup

Clone and set up the development environment:

# Clone the repository
git clone https://github.com/jonathansudhakar1/cloudcat.git
cd cloudcat

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode with all dependencies
pip install -e ".[all]"

# Run tests
pytest

Project Structure

cloudcat/
├── cloudcat/
│   ├── __init__.py         # Version info
│   ├── cli.py              # Main CLI entry point
│   ├── config.py           # Configuration management
│   ├── compression.py      # Compression handling
│   ├── filtering.py        # WHERE clause parsing
│   ├── formatters.py       # Output formatting
│   ├── readers/            # Format readers
│   │   ├── csv.py
│   │   ├── json.py
│   │   ├── parquet.py
│   │   ├── avro.py
│   │   ├── orc.py
│   │   └── text.py
│   └── storage/            # Cloud storage clients
│       ├── base.py
│       ├── gcs.py
│       ├── s3.py
│       └── azure.py
├── tests/
├── docs/
├── setup.py
└── README.md

Submitting Pull Requests

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Make your changes
  4. Add tests if applicable
  5. Run tests: pytest
  6. Commit your changes: git commit -m "Add my feature"
  7. Push to your fork: git push origin feature/my-feature
  8. Open a pull request

Code Style

  • Follow PEP 8 guidelines
  • Use meaningful variable names
  • Add docstrings to functions
  • Keep functions focused and small

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=cloudcat

# Run specific test file
pytest tests/test_cli.py

Adding a New File Format

To add support for a new file format:

  1. Create a new reader in cloudcat/readers/
  2. Export a read_*_data() function that returns (DataFrame, schema)
  3. Update cli.py to handle the new format
  4. Add format detection in cli.py
  5. Add tests
  6. Update documentation

Adding a New Cloud Provider

To add support for a new cloud provider:

  1. Create a new client in cloudcat/storage/
  2. Implement get_stream() and list_directory() functions
  3. Update cloudcat/storage/base.py to route to the new provider
  4. Add authentication handling
  5. Add tests
  6. Update documentation

Reporting Issues

When reporting bugs, please include:

  • CloudCat version (pip show cloudcat)
  • Python version (python --version)
  • Operating system
  • Full error message/traceback
  • Minimal reproduction steps
  • Sample data (if possible and not sensitive)

Feature Requests

When suggesting features:

  • Describe the use case
  • Explain how you'd like it to work
  • Consider if it fits CloudCat's scope
  • Be open to discussion on implementation

License

By contributing, you agree that your contributions will be licensed under the MIT License.

Roadmap

CloudCat is actively developed. Here's what's been accomplished and what's planned.

Completed Features

  • Google Cloud Storage support - Full GCS integration
  • Amazon S3 support - Full S3 integration with profiles
  • Azure Blob Storage support - Full Azure integration
  • CSV format - With custom delimiters
  • JSON format - Standard JSON and JSON Lines
  • Parquet format - With efficient column selection
  • Avro format - Full Avro support
  • ORC format - Via PyArrow
  • Plain text format - For log files
  • SQL-like filtering - WHERE clause support
  • Compression support - gzip, bz2, zstd, lz4, snappy
  • Row offset/pagination - Skip and limit rows
  • Schema inspection - View data types
  • Multi-file directories - Spark/Hive output support
  • Multiple output formats - table, json, jsonp, csv

Planned Features

  • Interactive mode - Pagination with keyboard navigation
  • Output to file - Direct --output-file option
  • Configuration file - .cloudcatrc for defaults
  • Multiple WHERE conditions - AND/OR operators
  • Sampling - Random row sampling
  • Profile support - Named configuration profiles
  • Delta Lake support - Read Delta tables
  • Iceberg support - Read Iceberg tables

Under Consideration

  • Write support - Converting and writing data
  • SQL queries - Full SQL query support via DuckDB
  • Data profiling - Basic statistics and profiling
  • Diff mode - Compare two files
  • Watch mode - Monitor file changes
  • Plugins - Custom reader/writer plugins

Version History

v0.2.2 (Current)

  • Bug fixes and improvements
  • Homebrew support for Apple Silicon

v0.2.0

  • Azure Blob Storage support
  • Avro and ORC format support
  • WHERE clause filtering
  • Row offset/pagination
  • Compression support (zstd, lz4, snappy)

v0.1.0

  • Initial release
  • GCS and S3 support
  • CSV, JSON, Parquet formats
  • Basic functionality

Contributing to the Roadmap

Have a feature idea? We'd love to hear it!

  1. Check existing issues for similar requests
  2. Open a new issue describing:
    • The use case
    • How it would work
    • Why it would be valuable
  3. Join the discussion

Feedback

Your feedback shapes CloudCat's development: