Spark Engine External Configuration

Overview

To configure your profiler pipeline to use Spark Engine, you need to add the processingEngine configuration to your existing YAML file.

Before configuring, ensure you have completed the Spark Engine Prerequisites and understand the Partitioning Requirements.

Step 1: Add Spark Engine Configuration

In your existing profiler YAML, add the processingEngine section under sourceConfig.config:

sourceConfig:
  config:
    type: Profiler
    # ... your existing configuration ...
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: your_path

<Tip>
**Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
</Tip>

Step 2: Add Partition Configuration

In the processor.config.tableConfig section, add the sparkTableProfilerConfig:

processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 10000000

Complete Example

Before (Native Engine)

sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name

processor:
  type: orm-profiler
  config: {}

After (Spark Engine)

sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: s3://your_s3_bucket/table
        # extraConfig:
        #     key: value
processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 1000000

Required Changes

Add processingEngine to sourceConfig.config
Add sparkTableProfilerConfig to your table configuration
Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)

Run the Pipeline

Use the same command as before:

metadata profile -c your_profiler_config.yaml

The pipeline will now use Spark Engine instead of the Native engine for processing.

Troubleshooting Configuration

Common Issues

Missing Partition Column: Ensure you’ve specified a suitable partition column
Network Connectivity: Verify Spark Connect and database connectivity
Driver Issues: Check that appropriate database drivers are installed in Spark cluster
Configuration Errors: Validate YAML syntax and required fields

Debugging Steps

Check Logs: Review profiler logs for specific error messages
Test Connectivity: Verify all network connections are working
Validate Configuration: Ensure all required fields are properly set
Test with Small Dataset: Start with a small table to verify the setup

UI Configuration

Configure Spark Engine through the Collate UI.

Overview

Data Quality & Observability

Spark Engine External Configuration | Collate Spark Profiling

Spark Engine External Configuration

Overview

Step 1: Add Spark Engine Configuration

Step 2: Add Partition Configuration

Complete Example

Before (Native Engine)

After (Spark Engine)

Required Changes

Run the Pipeline

Troubleshooting Configuration

Common Issues

Debugging Steps

UI Configuration

Overview

Data Quality & Observability

​Spark Engine External Configuration

​Overview

​Step 1: Add Spark Engine Configuration

​Step 2: Add Partition Configuration

​Complete Example

​Before (Native Engine)

​After (Spark Engine)

​Required Changes

​Run the Pipeline

​Troubleshooting Configuration

​Common Issues

​Debugging Steps

UI Configuration

Spark Engine External Configuration

Overview

Step 1: Add Spark Engine Configuration

Step 2: Add Partition Configuration

Complete Example

Before (Native Engine)

After (Spark Engine)

Required Changes

Run the Pipeline

Troubleshooting Configuration

Common Issues

Debugging Steps