How do I preview a file in Google Cloud Storage without downloading?

Use CloudCat with the command: cloudcat -p gcs://bucket/file.csv to preview CSV, JSON, Parquet, Avro, or ORC files directly from GCS without downloading the entire file.

What file formats does CloudCat support?

CloudCat supports CSV, JSON, JSON Lines, Parquet, Avro, ORC, and plain text files, with automatic decompression for gzip, zstd, lz4, snappy, and bz2 compressed files.

How do I install CloudCat?

Install CloudCat using pip: pip install 'cloudcat[all]' for full installation with all formats and compression support, or pip install cloudcat for standard installation.

Can CloudCat preview S3 files?

Yes, CloudCat supports Amazon S3. Use the command: cloudcat -p s3://bucket/path/file.parquet to preview files directly from S3 buckets.

Does CloudCat support Azure Blob Storage?

Yes, CloudCat supports Azure Blob Storage. Use az:// or azure:// URL schemes to preview files: cloudcat -p az://container/file.csv

CloudCat - Preview Cloud Storage Data from Your Terminal

Features

CloudCat is designed to make previewing cloud data effortless. Here's what it offers:

Cloud Storage Support

Provider	URL Scheme	Status
Google Cloud Storage	`gcs://` or `gs://`	Supported
Amazon S3	`s3://`	Supported
Azure Blob Storage	`az://` or `azure://`	Supported

File Format Support

CloudCat automatically detects file formats from extensions and handles them appropriately:

Format	Read	Auto-Detect	Streaming	Use Case
CSV	Yes	Yes	Yes	General data files
JSON	Yes	Yes	Yes	API responses, configs
JSON Lines	Yes	Yes	Yes	Log files, streaming data
Parquet	Yes	Yes	Yes	Spark/analytics data
Avro	Yes	Yes	Yes	Kafka, data pipelines
ORC	Yes	Yes	Yes	Hive, Hadoop ecosystem
Text	Yes	Yes	Yes	Log files, plain text
TSV	Yes	Via `--delimiter`	Yes	Tab-separated data

Compression Support

CloudCat automatically detects and decompresses files based on extension:

Format	Extension	Built-in	Installation
Gzip	`.gz`, `.gzip`	Yes	Included
Bzip2	`.bz2`	Yes	Included
Zstandard	`.zst`, `.zstd`	Optional	`pip install cloudcat[zstd]`
LZ4	`.lz4`	Optional	`pip install cloudcat[lz4]`
Snappy	`.snappy`	Optional	`pip install cloudcat[snappy]`

Output Formats

Format	Flag	Description
Table	`-o table`	Beautiful ASCII table with colored headers (default)
JSON	`-o json`	Standard JSON Lines output (one record per line)
Pretty JSON	`-o jsonp`	Syntax-highlighted, indented JSON with colors
CSV	`-o csv`	Comma-separated values for further processing

Key Capabilities

Schema Inspection - View column names and data types before previewing data
Column Selection - Display only the columns you need with --columns
Row Limiting - Control how many rows to preview with --num-rows
Row Offset - Skip first N rows for pagination with --offset
WHERE Filtering - Filter rows with SQL-like conditions using --where
Record Counting - Get total record counts (instant for Parquet via metadata)
Multi-File Reading - Combine data from multiple files in a directory
Custom Delimiters - Support for tab, pipe, semicolon, and other delimiters
Auto Decompression - Transparent handling of compressed files
Directory Intelligence - Automatically discovers data files in Spark/Hive outputs

Installation

Homebrew (macOS Apple Silicon)

The easiest way to install on Apple Silicon Macs (M1/M2/M3/M4) — no Python required:

brew install jonathansudhakar1/cloudcat/cloudcat

This installs a self-contained binary that includes Python and all dependencies.

Intel Mac users: Homebrew bottles are not available for Intel. Please use pip install 'cloudcat[all]' instead.

To upgrade:

brew upgrade cloudcat

Note: On first run, macOS may block the app. Go to System Settings > Privacy & Security and click "Allow", or run:
xattr -d com.apple.quarantine $(which cloudcat)

pip (All Platforms)

Install CloudCat with all features enabled:

pip install 'cloudcat[all]'

This includes support for all cloud providers (GCS, S3, Azure), all file formats (Parquet, Avro, ORC), and all compression types (zstd, lz4, snappy).

Standard pip Installation

For basic functionality with GCS, S3, and Azure support:

pip install cloudcat

Includes CSV, JSON, and text format support with gzip and bz2 compression.

Install with Specific Features

Install only what you need:

Extra	Command	Adds Support For
`parquet`	`pip install 'cloudcat[parquet]'`	Apache Parquet files
`avro`	`pip install 'cloudcat[avro]'`	Apache Avro files
`orc`	`pip install 'cloudcat[orc]'`	Apache ORC files
`compression`	`pip install 'cloudcat[compression]'`	zstd, lz4, snappy
`zstd`	`pip install 'cloudcat[zstd]'`	Zstandard compression only
`lz4`	`pip install 'cloudcat[lz4]'`	LZ4 compression only
`snappy`	`pip install 'cloudcat[snappy]'`	Snappy compression only

Requirements

Homebrew: macOS Apple Silicon (M1/M2/M3/M4). Intel Mac users should use pip.
pip: Python 3.7 or higher (all platforms)
Cloud Credentials: Configured for your cloud provider (see Authentication)

Note: If using zsh (default on macOS), quotes around extras are required to prevent shell interpretation of brackets.

Upgrading

Upgrade to the latest version:

pip install --upgrade cloudcat

Or with all extras:

pip install --upgrade 'cloudcat[all]'

Verifying Installation

Check that CloudCat is installed correctly:

cloudcat --help

You should see the help output with all available options.

Quick Start

Get started with CloudCat in seconds. Here are the most common operations:

Preview a CSV File

# From Google Cloud Storage
cloudcat -p gcs://my-bucket/data.csv

# From Amazon S3
cloudcat -p s3://my-bucket/data.csv

# From Azure Blob Storage
cloudcat -p az://my-container/data.csv

Preview Parquet Files

# Preview first 10 rows (default)
cloudcat -p s3://my-bucket/analytics/events.parquet

# Preview 50 rows
cloudcat -p gcs://my-bucket/data.parquet -n 50

Preview JSON Data

# Standard JSON
cloudcat -p s3://my-bucket/config.json

# JSON Lines (newline-delimited JSON)
cloudcat -p gcs://my-bucket/events.jsonl

# With pretty formatting
cloudcat -p az://my-container/logs.json -o jsonp

Select Specific Columns

cloudcat -p gcs://bucket/users.json -c id,name,email

Filter Rows

# Exact match
cloudcat -p s3://bucket/users.parquet --where "status=active"

# Numeric comparison
cloudcat -p gcs://bucket/events.json --where "age>30"

# String contains
cloudcat -p s3://bucket/logs.csv --where "message contains error"

View Schema Only

cloudcat -p s3://bucket/events.parquet -s schema_only

Read Compressed Files

CloudCat automatically decompresses files:

# Gzip
cloudcat -p gcs://bucket/data.csv.gz

# Zstandard
cloudcat -p s3://bucket/events.parquet.zst

# LZ4
cloudcat -p s3://bucket/data.csv.lz4

Read from Spark Output Directory

cloudcat -p s3://my-bucket/spark-output/ -i parquet

CloudCat automatically discovers data files and ignores metadata files like _SUCCESS.

Pagination

# Skip first 100 rows, show next 10
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10

Convert and Export

# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

# Export specific columns
cloudcat -p s3://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv

# Pipe to jq for JSON processing
cloudcat -p s3://bucket/events.json -o json | jq '.status'

Command Reference

Complete reference for all CloudCat command-line options.

Usage

cloudcat [OPTIONS]

Required Options

Option	Description
`-p, --path TEXT`	Cloud storage path (required). Format: `gcs://bucket/path`, `s3://bucket/path`, or `az://container/path`

Output & Format Options

Option	Default	Description
`-o, --output-format TEXT`	`table`	Output format: `table`, `json`, `jsonp`, `csv`
`-i, --input-format TEXT`	auto-detect	Input format: `csv`, `json`, `parquet`, `avro`, `orc`, `text`

Data Selection Options

Option	Default	Description
`-c, --columns TEXT`	all	Comma-separated list of columns to display
`-n, --num-rows INTEGER`	10	Number of rows to display (0 for all rows)
`--offset INTEGER`	0	Skip first N rows

Filtering & Schema Options

Option	Default	Description
`-w, --where TEXT`	none	Filter rows with SQL-like conditions
`-s, --schema TEXT`	`show`	Schema display: `show`, `dont_show`, `schema_only`
`--no-count`	false	Disable automatic record counting

Directory Handling Options

Option	Default	Description
`-m, --multi-file-mode TEXT`	`auto`	Directory handling: `auto`, `first`, `all`
`--max-size-mb INTEGER`	25	Max data size for multi-file mode in MB

CSV Options

Option	Default	Description
`-d, --delimiter TEXT`	comma	CSV delimiter (use `\t` for tab)

Cloud Provider Authentication

Option	Description
`--profile TEXT`	AWS profile name (for S3 access)
`--project TEXT`	GCP project ID (for GCS access)
`--credentials TEXT`	Path to GCP service account JSON file
`--account TEXT`	Azure storage account name

General Options

Option	Description
`--help`	Show help message and exit

Examples

# Basic usage
cloudcat -p gcs://bucket/data.csv

# Select columns and limit rows
cloudcat -p s3://bucket/users.parquet -c id,name,email -n 20

# Filter with WHERE clause
cloudcat -p gcs://bucket/events.json --where "status=active"

# Output as JSON
cloudcat -p az://container/data.csv -o json

# Read from Spark output directory
cloudcat -p s3://bucket/spark-output/ -i parquet -m all

# Use custom delimiter for TSV
cloudcat -p gcs://bucket/data.tsv -d "\t"

# Pagination
cloudcat -p s3://bucket/large.csv --offset 100 -n 10

# Schema only
cloudcat -p gcs://bucket/events.parquet -s schema_only

# With AWS profile
cloudcat -p s3://bucket/data.csv --profile production

# With GCP credentials
cloudcat -p gcs://bucket/data.csv --credentials /path/to/key.json

WHERE Operators

CloudCat supports SQL-like filtering with the --where option. Filter your data before it's displayed to focus on exactly what you need.

Supported Operators

Operator	Example	Description
`=`	`status=active`	Exact match
`!=`	`type!=deleted`	Not equal
`>`	`age>30`	Greater than
`<`	`price<100`	Less than
`>=`	`count>=10`	Greater than or equal
`<=`	`score<=50`	Less than or equal
`contains`	`name contains john`	Case-insensitive substring match
`startswith`	`email startswith admin`	String prefix match
`endswith`	`file endswith .csv`	String suffix match

Usage Examples

Exact Match

# Filter by status
cloudcat -p s3://bucket/users.parquet --where "status=active"

# Filter by category
cloudcat -p gcs://bucket/products.json --where "category=electronics"

Numeric Comparisons

# Greater than
cloudcat -p s3://bucket/users.parquet --where "age>30"

# Less than
cloudcat -p gcs://bucket/orders.csv --where "price<100"

# Greater than or equal
cloudcat -p s3://bucket/events.json --where "count>=10"

# Less than or equal
cloudcat -p gcs://bucket/scores.parquet --where "score<=50"

String Matching

# Contains (case-insensitive)
cloudcat -p s3://bucket/logs.json --where "message contains error"

# Starts with
cloudcat -p gcs://bucket/users.csv --where "email startswith admin"

# Ends with
cloudcat -p s3://bucket/files.json --where "filename endswith .csv"

Not Equal

# Exclude deleted records
cloudcat -p gcs://bucket/records.parquet --where "status!=deleted"

# Exclude specific type
cloudcat -p s3://bucket/events.json --where "type!=test"

Combining with Other Options

# Filter and select columns
cloudcat -p s3://bucket/users.parquet --where "status=active" -c id,name,email

# Filter and limit rows
cloudcat -p gcs://bucket/events.json --where "type=error" -n 50

# Filter with pagination
cloudcat -p s3://bucket/logs.csv --where "level=ERROR" --offset 100 -n 20

# Filter and export
cloudcat -p gcs://bucket/users.parquet --where "country=US" -o csv -n 0 > us_users.csv

Tips

String values don't need quotes in the WHERE clause
Comparisons are type-aware (numeric columns compare numerically)
The contains, startswith, and endswith operators are case-insensitive
For best performance, filter on columns that exist in your data

Authentication

CloudCat uses standard authentication methods for each cloud provider. Configure your credentials once and CloudCat will use them automatically.

Google Cloud Storage (GCS)

CloudCat uses Application Default Credentials (ADC) for GCS authentication.

Option 1: User Credentials (Development)

Best for local development:

gcloud auth application-default login

This opens a browser for Google account authentication.

Option 2: Service Account (Environment Variable)

Set the path to your service account JSON file:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Then use CloudCat normally:

cloudcat -p gcs://bucket/data.csv

Option 3: Service Account (CLI Option)

Pass the credentials file directly:

cloudcat -p gcs://bucket/data.csv --credentials /path/to/service-account.json

Option 4: Specify GCP Project

If your credentials have access to multiple projects:

cloudcat -p gcs://bucket/data.csv --project my-gcp-project

Amazon S3

CloudCat uses the standard AWS credential chain.

Option 1: Environment Variables

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

Option 2: AWS Credentials File

Configure credentials using the AWS CLI:

aws configure

This creates ~/.aws/credentials with your access keys.

Option 3: Named Profile

Use a specific AWS profile:

cloudcat -p s3://bucket/data.csv --profile production

Profiles are defined in ~/.aws/credentials:

[production]
aws_access_key_id = AKIA...
aws_secret_access_key = ...
region = us-west-2

Option 4: IAM Role (EC2/ECS/Lambda)

When running on AWS infrastructure (EC2, ECS, Lambda), CloudCat automatically uses the attached IAM role. No configuration needed.

Azure Blob Storage

CloudCat supports multiple authentication methods for Azure.

Option 1: Connection String (Simplest)

Set the full connection string:

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

Option 2: Azure AD Authentication

Use Azure CLI login with account URL:

# Set the account URL
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"

# Login with Azure CLI
az login

CloudCat will use DefaultAzureCredential to authenticate.

Option 3: Storage Account (CLI Option)

Specify the storage account directly:

cloudcat -p az://container/data.csv --account mystorageaccount

This requires either a connection string or Azure AD authentication to be configured.

Path Formats

Provider	URL Format	Example
GCS	`gcs://bucket/path` or `gs://bucket/path`	`gcs://my-bucket/data/file.csv`
S3	`s3://bucket/path`	`s3://my-bucket/data/file.parquet`
Azure	`az://container/path` or `azure://container/path`	`az://my-container/data/file.json`

Troubleshooting Authentication

GCS: "Could not automatically determine credentials"

gcloud auth application-default login

S3: "Unable to locate credentials"

aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

Azure: "Azure credentials not found"

# Option 1: Set connection string
export AZURE_STORAGE_CONNECTION_STRING="..."

# Option 2: Use Azure CLI
export AZURE_STORAGE_ACCOUNT_URL="https://account.blob.core.windows.net"
az login

Directory Operations

CloudCat intelligently handles directories containing multiple data files, common with Spark, Hive, and distributed processing outputs.

Multi-File Mode

Control how CloudCat handles directories with the -m, --multi-file-mode option:

Mode	Description
`auto`	Smart selection based on directory contents (default)
`first`	Read only the first data file found
`all`	Combine data from all files in the directory

Auto Mode (Default)

In auto mode, CloudCat analyzes the directory and makes smart decisions:

cloudcat -p s3://bucket/spark-output/

Scans directory for data files
Ignores metadata files (_SUCCESS, _metadata, .crc, etc.)
Selects appropriate files based on format
Reports which files were selected

First File Mode

Read only the first file for quick sampling:

cloudcat -p gcs://bucket/large-output/ -m first

Best for:

Quick data validation
Large directories with many files
When you only need a sample

All Files Mode

Combine data from multiple files:

cloudcat -p s3://bucket/daily-logs/ -m all

Best for:

Aggregating partitioned data
Reading complete datasets
Directories with related files

Size Limits

Control maximum data size when reading multiple files:

# Read up to 100MB of data
cloudcat -p gcs://bucket/events/ -m all --max-size-mb 100

Default is 25MB to prevent accidentally loading huge datasets.

Automatic File Filtering

CloudCat automatically ignores common metadata files:

_SUCCESS - Spark/Hadoop success markers
_metadata - Parquet metadata files
_common_metadata - Parquet common metadata
.crc files - Checksum files
.committed - Transaction markers
.pending - Pending transaction files
_temporary directories - Temporary files

Examples

Spark Output Directory

# Typical Spark output structure:
# s3://bucket/output/
#   _SUCCESS
#   part-00000-abc.parquet
#   part-00001-def.parquet

cloudcat -p s3://bucket/output/ -i parquet
# Automatically reads part files, ignores _SUCCESS

Hive Partitioned Data

# Partitioned structure:
# gcs://bucket/events/
#   year=2024/month=01/data.parquet
#   year=2024/month=02/data.parquet

cloudcat -p gcs://bucket/events/ -m all -i parquet

Daily Log Files

# Log directory:
# s3://bucket/logs/
#   2024-01-15.json
#   2024-01-16.json
#   2024-01-17.json

cloudcat -p s3://bucket/logs/ -m all -n 100

Large Directory Sampling

# Quick preview of first file only
cloudcat -p gcs://bucket/huge-dataset/ -m first -n 20

Format Detection in Directories

When reading from a directory, you may want to specify the format:

# Explicitly set format for directory
cloudcat -p s3://bucket/output/ -i parquet

# Auto-detect from first matching file
cloudcat -p gcs://bucket/data/

CloudCat examines file extensions to determine format when not specified.

Tips

Use -m first for quick validation of large directories
Use --max-size-mb to control memory usage with -m all
Specify -i format when directory contains mixed file types
CloudCat preserves column order across multiple files

Output Formats

CloudCat supports multiple output formats to suit different workflows. Use the -o, --output-format option to choose.

Table (Default)

Beautiful ASCII tables with colored headers, perfect for terminal viewing:

cloudcat -p gcs://bucket/data.csv

Output:

┌────────┬─────────────┬─────────────────────┬────────────────────┐
│ id     │ name        │ email               │ created_at         │
├────────┼─────────────┼─────────────────────┼────────────────────┤
│ 1      │ Alice       │ alice@example.com   │ 2024-01-15 10:30   │
│ 2      │ Bob         │ bob@example.com     │ 2024-01-15 11:45   │
│ 3      │ Charlie     │ charlie@example.com │ 2024-01-16 09:00   │
└────────┴─────────────┴─────────────────────┴────────────────────┘

Best for:

Interactive terminal use
Quick data inspection
Readable output

JSON Lines

Standard JSON Lines format (one JSON object per line):

cloudcat -p s3://bucket/data.parquet -o json

Output:

{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}

Best for:

Piping to jq for processing
Integration with other tools
Machine-readable output

Processing with jq

# Filter by field
cloudcat -p s3://bucket/events.json -o json | jq 'select(.status == "error")'

# Extract specific fields
cloudcat -p gcs://bucket/users.parquet -o json | jq '.email'

# Count by field
cloudcat -p s3://bucket/logs.json -o json -n 0 | jq -s 'group_by(.level) | map({level: .[0].level, count: length})'

Pretty JSON

Syntax-highlighted, indented JSON for human readability:

cloudcat -p gcs://bucket/config.json -o jsonp

Output:

{
  "id": 1,
  "name": "Alice",
  "metadata": {
    "created": "2024-01-15",
    "tags": ["user", "active"]
  }
}

Best for:

Viewing nested JSON structures
Debugging API responses
Human-readable inspection

CSV

Comma-separated values for export and further processing:

cloudcat -p s3://bucket/data.parquet -o csv

Output:

id,name,email,created_at
1,Alice,alice@example.com,2024-01-15 10:30
2,Bob,bob@example.com,2024-01-15 11:45
3,Charlie,charlie@example.com,2024-01-16 09:00

Best for:

Exporting to spreadsheets
Further data processing
Format conversion

Export Examples

Convert Parquet to CSV

cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

Export Specific Columns

cloudcat -p s3://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv

Export Filtered Data

cloudcat -p gcs://bucket/events.json --where "status=error" -o csv -n 0 > errors.csv

Convert JSON to CSV

cloudcat -p s3://bucket/api-response.json -o csv > response.csv

Combining Output Formats with Other Options

# Table with column selection
cloudcat -p gcs://bucket/data.csv -c id,name,email -o table

# JSON with filtering
cloudcat -p s3://bucket/users.parquet --where "active=true" -o json

# CSV with row limit
cloudcat -p gcs://bucket/events.json -o csv -n 100

# Pretty JSON with schema
cloudcat -p s3://bucket/config.json -o jsonp -s show

Tips

Use -n 0 to output all rows when exporting
Use table for interactive inspection, json for piping, csv for export
Redirect output to a file with > filename for large exports
The jsonp format includes colors; redirect to file loses color codes

Use Cases

Real-world scenarios where CloudCat shines.

Debugging Spark Jobs

Quickly validate Spark job output without downloading files:

# Check output of a Spark job
cloudcat -p gcs://data-lake/jobs/daily-etl/output/ -i parquet -n 20

# Verify schema matches expectations
cloudcat -p s3://analytics/spark-output/ -s schema_only

# Sample data from large output
cloudcat -p gcs://bucket/aggregations/ -m first -n 50

Log Analysis

Preview and filter log files stored in cloud storage:

# Preview recent logs
cloudcat -p gcs://logs/app/2024-01-15/ -m all -n 50

# Filter for errors
cloudcat -p s3://logs/api/ --where "level=ERROR" -n 100

# Search log messages
cloudcat -p gcs://logs/app/ --where "message contains timeout"

# Export errors for analysis
cloudcat -p s3://logs/errors/ -o json -n 0 | jq 'select(.status >= 500)'

Data Validation

Verify data quality and structure before processing:

# Quick sanity check on data export
cloudcat -p gcs://exports/daily/users.csv -s show

# Verify record count
cloudcat -p s3://warehouse/transactions.parquet

# Check for null values (preview and inspect)
cloudcat -p gcs://data/customers.parquet -n 100

# Validate schema before ETL
cloudcat -p s3://input/raw-data.json -s schema_only

Format Conversion

Convert between data formats using CloudCat:

# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

# Convert JSON to CSV for spreadsheet import
cloudcat -p s3://api-dumps/response.json -o csv > data.csv

# Convert tab-separated to comma-separated
cloudcat -p gcs://imports/data.tsv -d "\t" -o csv > converted.csv

# Export Avro as JSON Lines
cloudcat -p s3://kafka/events.avro -o json -n 0 > events.jsonl

Data Exploration

Understand unfamiliar datasets quickly:

# View schema of unknown file
cloudcat -p s3://vendor-data/export.parquet -s schema_only

# Preview first few rows
cloudcat -p gcs://bucket/new-data.csv -n 5

# Check all columns
cloudcat -p s3://bucket/wide-table.parquet -n 3

Data Sampling

Get representative samples from large datasets:

# Random-ish sample (use offset)
cloudcat -p gcs://bucket/huge-table.parquet --offset 10000 -n 100

# Sample from each partition
cloudcat -p s3://bucket/year=2024/month=01/ -m first -n 50
cloudcat -p s3://bucket/year=2024/month=02/ -m first -n 50

# Quick peek at different columns
cloudcat -p gcs://bucket/data.parquet -c user_id,event_type -n 20
cloudcat -p gcs://bucket/data.parquet -c timestamp,value -n 20

Pipeline Debugging

Debug data pipeline issues:

# Check intermediate outputs
cloudcat -p s3://pipeline/stage1-output/ -i parquet -n 10
cloudcat -p s3://pipeline/stage2-output/ -i parquet -n 10

# Compare schemas between stages
cloudcat -p gcs://etl/raw/ -s schema_only
cloudcat -p gcs://etl/transformed/ -s schema_only

# Find records with specific IDs
cloudcat -p s3://data/users.parquet --where "user_id=12345"

Kafka/Event Streaming

Preview data from Kafka exports:

# Read Avro files from Kafka Connect
cloudcat -p s3://kafka-exports/topic-name/ -i avro

# Filter events by type
cloudcat -p gcs://events/user-actions/ --where "event_type=purchase"

# Preview JSON events
cloudcat -p s3://kinesis/events.jsonl -o jsonp

Multi-Cloud Data Access

Work with data across multiple cloud providers:

# Compare data between clouds
cloudcat -p gcs://source/data.parquet -c id,value -n 100
cloudcat -p s3://destination/data.parquet -c id,value -n 100

# Verify replication
cloudcat -p gcs://primary/users.csv
cloudcat -p az://backup/users.csv

Integration with Other Tools

Combine CloudCat with other command-line tools:

# Count records with wc
cloudcat -p s3://bucket/data.csv -o csv -n 0 | wc -l

# Filter with grep
cloudcat -p gcs://logs/app.json -o json | grep "ERROR"

# Process with awk
cloudcat -p s3://data/report.csv -o csv | awk -F',' '{sum+=$3} END {print sum}'

# Sort and unique
cloudcat -p gcs://data/users.csv -c country -o csv -n 0 | sort | uniq -c

Performance Tips

Optimize CloudCat for faster performance and lower data transfer costs.

1. Use --no-count for Large Files

By default, CloudCat counts total records in the file. For large files, skip this:

cloudcat -p s3://bucket/huge-file.csv --no-count

This is especially helpful for CSV and JSON files where counting requires scanning the entire file.

2. Prefer Parquet Format

Parquet files offer the best performance with CloudCat:

Instant record counts from metadata (no file scan needed)
Column pruning when using --columns (only reads selected columns)
Better compression means less data transfer

# Record count is instant for Parquet
cloudcat -p gcs://bucket/data.parquet

# Column selection reads only needed columns
cloudcat -p s3://bucket/wide-table.parquet -c id,name,email

3. Limit Rows with --num-rows

Reduce data transfer by limiting rows:

# Preview only 20 rows instead of default 10
cloudcat -p gcs://bucket/data.csv -n 20

# Don't use -n 0 (all rows) unless you need everything

4. Select Only Needed Columns

With columnar formats (Parquet, ORC), column selection reduces data transfer:

# Reads only 3 columns instead of all 50
cloudcat -p s3://bucket/wide-table.parquet -c user_id,event_type,timestamp

5. Use First File Mode for Directories

When you only need a sample from a directory with many files:

# Read only the first file
cloudcat -p gcs://bucket/spark-output/ -m first

# Instead of reading all files
cloudcat -p gcs://bucket/spark-output/ -m all

6. Set Appropriate Size Limits

Control memory usage when reading multiple files:

# Limit to 10MB for quick preview
cloudcat -p s3://bucket/logs/ -m all --max-size-mb 10

# Increase for complete datasets
cloudcat -p s3://bucket/data/ -m all --max-size-mb 100

7. Use Schema-Only for Structure Checks

When you only need to check the schema:

# Instant - doesn't read data
cloudcat -p gcs://bucket/data.parquet -s schema_only

8. Compression Considerations

CloudCat handles compressed files efficiently:

Gzip/Bzip2 - Built-in, always available
Zstandard - Fast decompression, good for large files
LZ4 - Fastest decompression
Snappy - Good balance of speed and ratio

For best performance with large files, prefer zstd or lz4:

cloudcat -p s3://bucket/data.csv.zst -n 100

9. Network Considerations

CloudCat streams data, so network latency matters:

Run CloudCat close to your data (same region)
Use AWS EC2/GCP Compute/Azure VMs in the same region as your buckets
For local development, expect slower performance due to network transfer

10. Memory Management

For very large previews, be mindful of memory:

# This loads all data into memory
cloudcat -p s3://bucket/huge.parquet -n 0

# Better: limit rows
cloudcat -p s3://bucket/huge.parquet -n 1000

Performance Comparison

Operation	CSV	JSON	Parquet
Record Count	Slow (full scan)	Slow (full scan)	Instant (metadata)
Column Selection	Full file read	Full file read	Reads only selected
First N Rows	Fast (stops early)	Fast (stops early)	Fast
Compression	Standard	Standard	Built-in, efficient

Quick Reference

Goal	Recommendation
Fastest preview	`-n 10 --no-count`
Check structure	`-s schema_only`
Large directories	`-m first`
Wide tables	`-c col1,col2,col3`
Memory efficiency	Set reasonable `-n` value

Troubleshooting

Solutions to common issues when using CloudCat.

Missing Package Errors

"google-cloud-storage package is required"

pip install cloudcat
# or
pip install google-cloud-storage

"boto3 package is required"

pip install cloudcat
# or
pip install boto3

"azure-storage-blob package is required"

pip install cloudcat
# or
pip install azure-storage-blob azure-identity

"pyarrow package is required"

For Parquet or ORC file support:

pip install 'cloudcat[parquet]'
# or
pip install pyarrow

"fastavro package is required"

For Avro file support:

pip install 'cloudcat[avro]'
# or
pip install fastavro

"zstandard package is required for .zst files"

pip install 'cloudcat[zstd]'
# or for all compression:
pip install 'cloudcat[compression]'

"lz4 package is required for .lz4 files"

pip install 'cloudcat[lz4]'

"python-snappy package is required for .snappy files"

pip install 'cloudcat[snappy]'

Authentication Errors

GCS: "Could not automatically determine credentials"

Set up Google Cloud authentication:

# Option 1: User credentials
gcloud auth application-default login

# Option 2: Service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Option 3: CLI option
cloudcat -p gcs://bucket/file.csv --credentials /path/to/key.json

S3: "Unable to locate credentials"

Set up AWS authentication:

# Option 1: Configure AWS CLI
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# Option 3: Use named profile
cloudcat -p s3://bucket/file.csv --profile myprofile

Azure: "Azure credentials not found"

Set up Azure authentication:

# Option 1: Connection string
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

# Option 2: Azure AD
export AZURE_STORAGE_ACCOUNT_URL="https://account.blob.core.windows.net"
az login

# Option 3: Specify account
cloudcat -p az://container/file.csv --account mystorageaccount

Format Detection Issues

"Could not infer format from path"

When CloudCat can't determine the file format:

# Specify the format explicitly
cloudcat -p gcs://bucket/data -i parquet
cloudcat -p s3://bucket/file -i csv
cloudcat -p az://container/logs -i json

Reading files without extensions

cloudcat -p s3://bucket/data-file -i parquet

Access Permission Errors

"Access Denied" or "403 Forbidden"

Check that your credentials have the necessary permissions:

GCS:

storage.objects.get for reading files
storage.objects.list for listing directories

S3:

s3:GetObject for reading files
s3:ListBucket for listing directories

Azure:

Storage Blob Data Reader role or equivalent

Network Issues

Timeout errors

For slow connections or large files:

Use --num-rows to limit data transfer
Use --no-count to skip record counting
Check network connectivity to the cloud provider

"Connection reset" errors

May indicate network instability. Try:

# Smaller preview
cloudcat -p s3://bucket/file.csv -n 10 --no-count

Memory Issues

"MemoryError" or system slowdown

When previewing large files:

# Limit rows
cloudcat -p gcs://bucket/huge.parquet -n 100

# Don't load all rows
# Avoid: cloudcat -p s3://bucket/huge.csv -n 0

# Limit directory size
cloudcat -p s3://bucket/large-dir/ -m all --max-size-mb 10

CSV Issues

Wrong columns or parsing errors

For non-standard CSV files:

# Tab-separated
cloudcat -p gcs://bucket/data.tsv -d "\t"

# Pipe-delimited
cloudcat -p s3://bucket/data.txt -d "|"

# Semicolon-delimited
cloudcat -p gcs://bucket/data.csv -d ";"

Directory Issues

"No data files found in directory"

Check that:

The directory contains files with recognized extensions
Files aren't all metadata files (_SUCCESS, .crc, etc.)
You have permission to list the directory

# Specify format explicitly
cloudcat -p s3://bucket/output/ -i parquet

Getting Help

If you're still having issues:

Check you're using the latest version: pip install --upgrade cloudcat
Try with --help to see all options
Open an issue on GitHub with:
- CloudCat version
- Python version
- Full error message
- Command that caused the error

Contributing

Contributions are welcome! Here's how you can help improve CloudCat.

Ways to Contribute

Report Bugs - Open an issue with reproduction steps
Suggest Features - Open an issue describing the use case
Submit PRs - Fork, create a branch, and submit a pull request
Improve Docs - Help make the documentation better
Share CloudCat - Star the repo and spread the word

Development Setup

Clone and set up the development environment:

# Clone the repository
git clone https://github.com/jonathansudhakar1/cloudcat.git
cd cloudcat

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode with all dependencies
pip install -e ".[all]"

# Run tests
pytest

Project Structure

cloudcat/
├── cloudcat/
│   ├── __init__.py         # Version info
│   ├── cli.py              # Main CLI entry point
│   ├── config.py           # Configuration management
│   ├── compression.py      # Compression handling
│   ├── filtering.py        # WHERE clause parsing
│   ├── formatters.py       # Output formatting
│   ├── readers/            # Format readers
│   │   ├── csv.py
│   │   ├── json.py
│   │   ├── parquet.py
│   │   ├── avro.py
│   │   ├── orc.py
│   │   └── text.py
│   └── storage/            # Cloud storage clients
│       ├── base.py
│       ├── gcs.py
│       ├── s3.py
│       └── azure.py
├── tests/
├── docs/
├── setup.py
└── README.md

Submitting Pull Requests

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Make your changes
Add tests if applicable
Run tests: pytest
Commit your changes: git commit -m "Add my feature"
Push to your fork: git push origin feature/my-feature
Open a pull request

Code Style

Follow PEP 8 guidelines
Use meaningful variable names
Add docstrings to functions
Keep functions focused and small

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=cloudcat

# Run specific test file
pytest tests/test_cli.py

Adding a New File Format

To add support for a new file format:

Create a new reader in cloudcat/readers/
Export a read_*_data() function that returns (DataFrame, schema)
Update cli.py to handle the new format
Add format detection in cli.py
Add tests
Update documentation

Adding a New Cloud Provider

To add support for a new cloud provider:

Create a new client in cloudcat/storage/
Implement get_stream() and list_directory() functions
Update cloudcat/storage/base.py to route to the new provider
Add authentication handling
Add tests
Update documentation

Reporting Issues

When reporting bugs, please include:

CloudCat version (pip show cloudcat)
Python version (python --version)
Operating system
Full error message/traceback
Minimal reproduction steps
Sample data (if possible and not sensitive)

Feature Requests

When suggesting features:

Describe the use case
Explain how you'd like it to work
Consider if it fits CloudCat's scope
Be open to discussion on implementation

License

By contributing, you agree that your contributions will be licensed under the MIT License.