how-to-guides

No menu items for this category
Collate Documentation

Spark Engine Partitioning Requirements

The Spark Engine requires a partition column to efficiently process large datasets. This is because:

  1. Parallel Processing: Each partition can be processed independently across different Spark workers
  2. Resource Optimization: Prevents memory overflow and ensures stable processing of large datasets

The Spark Engine automatically detects and uses partition columns based on this logic:

  1. Manual Configuration: You can explicitly specify a partition column in the table configuration
  2. Primary Key Columns: If a table has a primary key with numeric or date/time data types, it's automatically selected
  • Numeric: SMALLINT, INT, BIGINT, NUMBER
  • Date/Time: DATE, DATETIME, TIMESTAMP, TIMESTAMPZ, TIME

If no suitable partition column is found, the table will be skipped during profiling. This ensures that only tables with proper partitioning can be processed by the Spark Engine, preventing potential performance issues or failures.

  1. High Cardinality: Choose columns with many unique values to ensure even data distribution
  2. Even Distribution: Avoid columns with heavily skewed data (e.g., mostly NULL values)
  3. Query Performance: Select columns that have an index created on them
  4. Data Type Compatibility: Ensure the column is of a supported data type for partitioning
Column TypeGood Partition ColumnPoor Partition Column
Numericuser_id, order_id, agestatus_code (limited values)
Date/Timecreated_date, updated_at, event_timestamplast_login (many NULLs)
  • Primary Keys: Usually excellent partition columns
  • Timestamps: Great for time-based partitioning
  • Foreign Keys: Good if they have high cardinality
  • Business Keys: Customer IDs, order IDs, etc.
  1. No Suitable Partition Column: Ensure your table has a column with supported data types
  2. Low Cardinality: Choose a column with more unique values
  3. Data Type Mismatch: Verify the column is of a supported data type
  4. Missing Index: Consider adding an index to improve partitioning performance
  1. Check Table Schema: Verify available columns and their data types
  2. Analyze Column Distribution: Check for NULL values and cardinality
  3. Test Partition Column: Validate the chosen column works with your data
  4. Review Logs: Check profiler logs for specific partitioning errors