Spark Engine External Configuration
Overview
To configure your profiler pipeline to use Spark Engine, you need to add the processingEngine
configuration to your existing YAML file.
Before configuring, ensure you have completed the Spark Engine Prerequisites and understand the Partitioning Requirements.
Step 1: Add Spark Engine Configuration
In your existing profiler YAML, add the processingEngine
section under sourceConfig.config
:
Step 2: Add Partition Configuration
In the processor.config.tableConfig
section, add the sparkTableProfilerConfig
:
Complete Example
Before (Native Engine)
After (Spark Engine)
Required Changes
- Add
processingEngine
tosourceConfig.config
- Add
sparkTableProfilerConfig
to your table configuration - Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)
Run the Pipeline
Use the same command as before:
The pipeline will now use Spark Engine instead of the Native engine for processing.
Troubleshooting Configuration
Common Issues
- Missing Partition Column: Ensure you've specified a suitable partition column
- Network Connectivity: Verify Spark Connect and database connectivity
- Driver Issues: Check that appropriate database drivers are installed in Spark cluster
- Configuration Errors: Validate YAML syntax and required fields
Debugging Steps
- Check Logs: Review profiler logs for specific error messages
- Test Connectivity: Verify all network connections are working
- Validate Configuration: Ensure all required fields are properly set
- Test with Small Dataset: Start with a small table to verify the setup