Spark Engine External Configuration
Overview
To configure your profiler pipeline to use Spark Engine, you need to add theprocessingEngine configuration to your existing YAML file.
Step 1: Add Spark Engine Configuration
In your existing profiler YAML, add theprocessingEngine section under sourceConfig.config:
Step 2: Add Partition Configuration
In theprocessor.config.tableConfig section, add the sparkTableProfilerConfig:
Complete Example
Before (Native Engine)
After (Spark Engine)
Required Changes
- Add
processingEnginetosourceConfig.config - Add
sparkTableProfilerConfigto your table configuration - Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)
Run the Pipeline
Use the same command as before:Troubleshooting Configuration
Common Issues
- Missing Partition Column: Ensure you’ve specified a suitable partition column
- Network Connectivity: Verify Spark Connect and database connectivity
- Driver Issues: Check that appropriate database drivers are installed in Spark cluster
- Configuration Errors: Validate YAML syntax and required fields
Debugging Steps
- Check Logs: Review profiler logs for specific error messages
- Test Connectivity: Verify all network connections are working
- Validate Configuration: Ensure all required fields are properly set
- Test with Small Dataset: Start with a small table to verify the setup