how-to-guides

No menu items for this category
Collate Documentation

Spark Engine Overview

The Spark Engine is a distributed processing engine in OpenMetadata that enables large-scale data profiling using Apache Spark. It's an alternative to the default Native engine, designed specifically for handling massive datasets that would be impractical or impossible to profile directly on the source database.

  • You have access to a Spark cluster (local, standalone, YARN, or Kubernetes)
  • Your datasets are too large to profile directly on the source database
  • You need distributed processing capabilities for enterprise-scale data profiling
  • Your source database doesn't have built-in distributed processing capabilities
  • You are using an already distributed processed database such as BigQuery or Snowflake
  • Your profiler pipeline runs smoothly directly on the source database
  • You're doing development or testing with small tables
  • You don't have access to a Spark cluster
  • You need the simplest possible setup

The Spark Engine integrates seamlessly with OpenMetadata's existing profiling framework while providing the distributed processing capabilities needed for enterprise-scale data profiling operations.