Ingestion Framework External Deployment
Any tool capable of running Python code can be used to configure the metadata extraction from your sources.1. How does the Ingestion Framework work?
The Ingestion Framework contains all the logic about how to connect to the sources, extract their metadata and send it to the OpenMetadata server. We have built it from scratch with the main idea of making it an independent component that can be run from - literally - anywhere. In order to install it, you just need to get it from PyPI.2. Ingestion Configuration
In the example above, theWorkflow class got created from a YAML configuration. Any Workflow that you execute (ingestion,
profiler, lineage,…) will have its own YAML representation.
You can think about this configuration as the recipe you want to execute: where is your source, which pieces do you
extract, how are they processed and where are they sent.
An example YAML config for extracting MySQL metadata looks like this:
workflowConfig.
Workflow Config
Here you will define information such as where are you hosting the OpenMetadata server, and the JWT token to authenticate. Logger Level You can specify theloggerLevel depending on your needs. If you are trying to troubleshoot an ingestion, running
with DEBUG will give you far more traces for identifying issues.
JWT Token
JWT tokens will allow your clients to authenticate against the OpenMetadata server.
To enable JWT Tokens, you will get more details here.
You can refer to the JWT Troubleshooting section link for any issues in
your JWT configuration.
Store Service Connection
If set to true (default), we will store the sensitive information either encrypted via the Fernet Key in the database
or externally, if you have configured any Secrets Manager.
If set to false, the service will be created, but the service connection information will only be used by the Ingestion
Framework at runtime, and won’t be sent to the OpenMetadata server.
Secrets Manager Configuration
If you have configured any Secrets Manager, you need to let the Ingestion Framework know
how to retrieve the credentials securely.
Follow the docs to configure the secret retrieval based on your environment.
SSL Configuration
If you have added SSL to the OpenMetadata server, then you will need to handle
the certificates when running the ingestion too. You can either set verifySSL to ignore, or have it as validate,
which will require you to set the sslConfig.caCertificate with a local path where your ingestion runs that points
to the server certificate file.
Find more information on how to troubleshoot SSL issues here.
JWT Token with Secrets Manager
If you are using the Secrets Manager, you can let the Ingestion client to pick up the JWT Token dynamically from the Secrets Manager at runtime. Let’s show an example: We have an OpenMetadata server running with themanaged-aws Secrets Manager. Since we used the OPENMETADATA_CLUSTER_NAME env var
as test, our ingestion-bot JWT Token is safely stored under the secret ID /test/bot/ingestion-bot/config/jwttoken.
Now, we can use the following workflow config to run the ingestion without having to pass the token, but just pointing to the secret itself:
- We specify the
secretsManagerProviderpointing toaws, since that’s the manager we are using. - We set
secretsManagerLoaderasenv. Since we’re running this from our local, we’ll let the AWS credentials to be loaded from the local env vars. (When running this using the UI, note that the generated workflows will have this value set asairflow!) - We set the
jwtTokenvalue assecret:/test/bot/ingestion-bot/config/jwttoken, which tells the client that this value is asecretlocated under/test/bot/ingestion-bot/config/jwttoken.
metadata ingest -c <path to yaml>.
3. (Optional) Ingestion Pipeline
Additionally, if you want to see your runs logged in theIngestions tab of the connectors page in the UI as you would
when running the connectors natively with OpenMetadata, you can add the following configuration on your YAMLs:
ingestionPipelineFQN - the Ingestion Pipeline Fully Qualified Name - will tell the Ingestion Framework
to log the executions and update the ingestion status, which will appear on the UI. Note that the action buttons
will be disabled, since OpenMetadata won’t be able to interact with external systems.
4. (Optional) Disable the Pipeline Service Client
If you want to run your workflows ONLY externally without relying on OpenMetadata for any workflow management or scheduling, you can update the following server configuration:enabled: false or setting the PIPELINE_SERVICE_CLIENT_ENABLED=false as an environment variable.
This will stop certain APIs and monitors related to the Pipeline Service Client (e.g., Airflow) from being operative.
Examples
Airflow
Run the ingestion process externally from Airflow
MWAA
Run the ingestion process externally using AWS MWAA
GCP Composer
Run the ingestion process externally from GCP Composer
GitHub Actions
Run the ingestion process externally from GitHub Actions
Testing
You can easily test every YAML configuration using themetadata CLI from the Ingestion Framework.
In order to install it, you just need to get it from PyPI.
In each of the examples below, we’ll showcase how to run the CLI, assuming you have a YAML file that contains
the workflow configuration.
Metadata Workflow
This is the first workflow you have to configure and run. It will take care of fetching the metadata from your sources, be it Database Services, Dashboard Services, Pipelines, etc. The rest of the workflows (Lineage, Profiler,…) will be executed on top of the metadata already available in the platform.1
Adding the imports
Adding the importsThe first step is to import the
MetadataWorkflow class, which will take care of the full ingestion logic. We’ll
add the import for printing the results at the end.2
Defining the YAML
Defining the YAMLThen, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can
read from a file, parse secrets from your environment, or any other approach you’d need. In the end, it’s just
Python code.
3
Preparing the Workflow
Preparing the WorkflowFinally, we’ll prepare a function that we can execute anywhere.It will take care of instantiating the workflow, executing it and giving us the results.
Lineage Workflow
This workflow will take care of scanning your query history and defining lineage relationships between your tables. You can find more information about this workflow here.1
Adding the imports
Adding the importsThe first step is to import the
MetadataWorkflow class, which will take care of the full ingestion logic. We’ll
add the import for printing the results at the end.Note that we are using the same class as in the Metadata Ingestion.2
Defining the YAML
Defining the YAMLThen, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can
read from a file, parse secrets from your environment, or any other approach you’d need.Note how we have not added here the
serviceConnection. Since the service would have been created during the
metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information.If, however, you are configuring the workflow with storeServiceConnection: false, you’ll need to explicitly
define the serviceConnection.3
Preparing the Workflow
Preparing the WorkflowFinally, we’ll prepare a function that we can execute anywhere.It will take care of instantiating the workflow, executing it and giving us the results.
Usage Workflow
As with the lineage workflow, we’ll scan the query history for any DML statements. The goal is to ingest queries into the platform, figure out the relevancy of your assets and frequently joined tables.1
Adding the imports
Adding the importsThe first step is to import the
UsageWorkflow class, which will take care of the full ingestion logic. We’ll
add the import for printing the results at the end.2
Defining the YAML
Defining the YAMLThen, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can
read from a file, parse secrets from your environment, or any other approach you’d need.Note how we have not added here the
serviceConnection. Since the service would have been created during the
metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information.If, however, you are configuring the workflow with storeServiceConnection: false, you’ll need to explicitly
define the serviceConnection.3
Preparing the Workflow
Preparing the WorkflowFinally, we’ll prepare a function that we can execute anywhere.It will take care of instantiating the workflow, executing it and giving us the results.
Profiler Workflow
This workflow will execute queries against your database and send the results into OpenMetadata. The goal is to compute metrics about your data and give you a high-level view of its shape, together with the sample data. This is an interesting previous step before creating Data Quality Workflows. You can find more information about this workflow here.1
Adding the imports
Adding the importsThe first step is to import the
ProfilerWorkflow class, which will take care of the full ingestion logic. We’ll
add the import for printing the results at the end.2
Defining the YAML
Defining the YAMLThen, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can
read from a file, parse secrets from your environment, or any other approach you’d need.Note how we have not added here the
serviceConnection. Since the service would have been created during the
metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information.If, however, you are configuring the workflow with storeServiceConnection: false, you’ll need to explicitly
define the serviceConnection.3
Preparing the Workflow
Preparing the WorkflowFinally, we’ll prepare a function that we can execute anywhere.It will take care of instantiating the workflow, executing it and giving us the results.
Data Quality Workflow
This workflow will execute queries against your database and send the results into OpenMetadata. The goal is to compute metrics about your data and give you a high-level view of its shape, together with the sample data. This is an interesting previous step before creating Data Quality Workflows. You can find more information about this workflow here.1
Adding the imports
Adding the importsThe first step is to import the
TestSuiteWorkflow class, which will take care of the full ingestion logic. We’ll
add the import for printing the results at the end.2
Defining the YAML
Defining the YAMLThen, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can
read from a file, parse secrets from your environment, or any other approach you’d need.Note how we have not added here the
serviceConnection. Since the service would have been created during the
metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information.If, however, you are configuring the workflow with storeServiceConnection: false, you’ll need to explicitly
define the serviceConnection.Moreover, see how we are not configuring any tests in the processor. You can do that,
but even if nothing gets defined in the YAML, we will execute all the tests configured against the table.3
Preparing the Workflow
Preparing the WorkflowFinally, we’ll prepare a function that we can execute anywhere.It will take care of instantiating the workflow, executing it and giving us the results.