Skip to main content

Hybrid Ingestion Runner

The Hybrid Ingestion Runner is a component designed to enable Collate customers operating in hybrid environments to securely execute ingestion workflows within their own cloud infrastructure. In this setup, your SaaS instance is hosted on Collate’s cloud, while the workflows are going to deployed and executed within your private cloud. The Hybrid Runner acts as a bridge between these two environments, allowing ingestion workflows to be triggered and managed remotely-without requiring the customer to share secrets or sensitive credentials with Collate. It securely receives workflow execution requests and orchestrates them locally, maintaining full control and data privacy within the customer’s environment.

Prerequisites

Before setting up the Hybrid Ingestion Runner, ensure the following:
  • Hybrid Runner has been setup. Contact the Collate team for assistance with setting up the Hybrid Runner in your infrastructure.
  • Secrets manager configured on your cloud.

Configuration Steps for Admins

Once your DevOps team has installed and configured the Hybrid Runner, follow these steps as a Collate Admin to configure services and manage ingestion workflows.

1. Validate Hybrid Runner Setup

  • Go to Settings > Preferences > Ingestion Runners in the Collate UI.
  • Look for your runner in the list.
  • The status should display as Connected.
If the runner is not connected, reach out to Collate support.

2. Create a New Service

  • Navigate to Settings > Services.
  • Click + Add New Service.
  • Fill in the service details.
  • In the “Ingestion Runner” dropdown, choose the hybrid runner.
Even if you’re operating in hybrid mode, you can still choose “Collate SaaS Runner” to run the ingestion workflow within Collate’s SaaS environment.

3. Manage Secrets Securely

When executing workflows on your Hybrid environment, you have to use your existing cloud provider’s Secrets Manager to store sensitive credentials (like passwords or token), and reference them securely in Collate via the Hybrid Runner. Collate never stores or accesses these secrets directly-only the Hybrid Runner retrieves them at runtime from your own infrastructure. Steps:
  • Create your secret in your Secrets Manager of choice:
    • AWS Secrets Manager
    • Azure Key Vault
    • GCP Secret Manager
When creating a secret, store the value as-is (e.g., password123) without any additional formatting or encoding. The Hybrid Runner will handle the retrieval and decryption of the secret value at runtime. For example, in AWS Secrets Manager, you can click on Store a new secret > Other type of secret > Plaintext. You need to paste the secret as-is, without any other formatting (such as quotes, JSON, etc.). Finally, in the service connection form in Collate, reference the secret using the secret: prefix followed by the full path to your secret. 📌 For example, in AWS Secrets Manager, if your secret is stored at: /my/database/password, you would reference it in the service connection form as:
password: secret:/my/database/password
Note that this approach to handling secrets only works for values that are considered secrets in the connection form.You can identify these values since they mask the typing and have an icon on the right that toggles showing or hiding the input values.

Helm Deployment

The Hybrid Runner is deployed using a Helm chart that handles the installation of the Hybrid Runner Server, Argo Workflows, and all supporting components.

Prerequisites

  • Kubernetes cluster running version 1.28 or later
  • Helm and kubectl installed
  • AWS credentials provided by Collate (for pulling Docker images from ECR)

Minimal Configuration

Create a values.yaml with the following minimal configuration:
config:
  agentId: "aws-prod"  # A meaningful name, just for UI reference
  authToken: <Provided by Collate>
  serverHost: <mycluster>.getcollate.io
ecrRegistryHelper:
  collateCredentials:
    values:
      accessKeyId: <Provided by Collate>
      secretAccessKey: <Provided by Collate>
installArgoWorkflows: true
If you’re unsure about any values, reach out to your Collate support contact for credentials and configuration details.

Getting the Authentication Token

Log in to your Collate instance using an administrator account, navigate to Settings > Bots and search for ingestion. Click on IngestionBot and copy the OpenMetadata JWT Token.

Deploy

helm repo add collate-hybrid https://open-metadata.github.io/hybrid-ingestion-runner-helm-chart
helm repo update
helm install collate-prod collate-hybrid/hybrid-ingestion-runner --values values.yaml

Configuring Node Scheduling for Ingestion Pods

Common production issue: If every node in your Kubernetes cluster has a NoSchedule taint, ingestion pods will fail to schedule (FailedScheduling) unless matching tolerations are added. This is the most frequent cause of pods stuck in Pending state after deploying the Hybrid Runner.
Ingestion pods run as independent Kubernetes pods. If your cluster uses node taints to isolate workloads, you must configure tolerations and node affinity so that ingestion pods can be scheduled on the correct nodes. The configuration differs depending on your executor type. Use the section that matches your setup:
  • Argo Workflows executor (Hybrid Runner with Argo): use ARGO_PIPELINE_TYPE_CONFIGS
  • Simple Kubernetes executor (default, Hybrid Runner without Argo): use SIMPLEK8S_PIPELINE_TYPE_CONFIGS
Do not mix these two environment variables. Each executor reads from its own configuration key. Setting SIMPLEK8S_PIPELINE_TYPE_CONFIGS when using Argo will have no effect, and vice versa.
The configuration value is a JSON string scoped by pipeline type: automation, metadata, profiler, lineage.

Argo Workflows Executor

Add the following to your values.yaml:
extraEnvs:
  - name: ARGO_PIPELINE_TYPE_CONFIGS
    value: >-
      {
        "automation": {
          "tolerations": [
            {
              "key": "nodetype",
              "operator": "Equal",
              "value": "openmetadata-hybrid-runner",
              "effect": "NoSchedule"
            }
          ],
          "affinity": {
            "nodeAffinity": {
              "requiredDuringSchedulingIgnoredDuringExecution": {
                "nodeSelectorTerms": [
                  {
                    "matchExpressions": [
                      {
                        "key": "nodetype",
                        "operator": "In",
                        "values": ["openmetadata-hybrid-runner"]
                      }
                    ]
                  }
                ]
              }
            }
          }
        },
        "metadata": {
          "tolerations": [
            {
              "key": "nodetype",
              "operator": "Equal",
              "value": "openmetadata-hybrid-runner",
              "effect": "NoSchedule"
            }
          ]
        },
        "profiler": {
          "tolerations": [
            {
              "key": "nodetype",
              "operator": "Equal",
              "value": "openmetadata-hybrid-runner",
              "effect": "NoSchedule"
            }
          ]
        },
        "lineage": {
          "tolerations": [
            {
              "key": "nodetype",
              "operator": "Equal",
              "value": "openmetadata-hybrid-runner",
              "effect": "NoSchedule"
            }
          ]
        }
      }
Replace nodetype and openmetadata-hybrid-runner with the taint key and value used in your cluster. You can check your node taints by running:
kubectl get nodes -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Simple Kubernetes Executor

If you are using the default Simple K8s executor (no Argo Workflows), configure the equivalent using SIMPLEK8S_PIPELINE_TYPE_CONFIGS:
extraEnvs:
  - name: SIMPLEK8S_PIPELINE_TYPE_CONFIGS
    value: >-
      {
        "automation": {
          "tolerations": [
            {
              "key": "nodetype",
              "operator": "Equal",
              "value": "openmetadata-hybrid-runner",
              "effect": "NoSchedule"
            }
          ],
          "nodeSelector": {
            "nodetype": "openmetadata-hybrid-runner"
          }
        }
      }

Supported Configuration Fields per Pipeline Type

Each pipeline type entry (automation, metadata, profiler, lineage) supports the following fields:
FieldDescription
tolerationsList of Kubernetes toleration objects to allow scheduling on tainted nodes
affinityNode/pod affinity rules (supports nodeAffinity, podAffinity, podAntiAffinity)
nodeSelectorKey-value labels to target specific nodes
priorityClassNameKubernetes priority class for the pod
resourcesCPU and memory requests and limits

Additional Settings

Hosting Your Own Docker Images

If you need to use your own Docker registry instead of Collate’s ECR, update the following Helm values. For the Hybrid Runner pod:
image:
  repository: my-repo.com/my-image
  tag: my-tag
imagePullSecrets: my-credentials
For Ingestion pods:
config:
  ingestionPods:
    repository: my-repo.com/my-image
    tag: my-tag
    imagePullSecrets: my-credentials
If you host your own images, make sure to do so for both the Hybrid Runner and Ingestion pods.
By default, the Hybrid Runner dynamically resolves ingestion pod image tags to match the Collate server version (e.g., server version 1.11.1 → image tag om-1.11.1-cl-1.11.1). If you manage your own tags, disable this behavior:
extraEnvs:
  - name: DYNAMIC_INGESTION_VERSION_ENABLED
    value: 'false'

Configuring Cloud Secret Stores

AWS (EKS Pod Identity or IRSA)

Configure Pod Identity via EKS Pod Identity or IRSA to assume an IAM role. The serviceAccount name is ingestion by default. Required IAM permissions:
  • secretsmanager:GetSecretValue
  • secretsmanager:DescribeSecret
  • secretsmanager:ListSecrets
For IRSA, add the following to your values.yaml:
config:
  secretsManager: "managed-aws"
  ingestionPods:
    serviceAccount:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::<account>:role/<role-name>

Azure (Workload Identity)

Configure Workload Identity with a User Assigned Managed Identity. Required role: Key Vault Secrets Officer.
config:
  secretsManager: "managed-azure-kv"
  ingestionPods:
    serviceAccount:
      annotations:
        "azure.workload.identity/client-id": <user_assigned_managed_identity_client_id>
    extraEnvs: "AZURE_KEY_VAULT_NAME:<azure_key_vault_name>"
argoWorkflows:
  controller:
    workflowDefaults:
      spec:
        podMetadata:
          labels:
            azure.workload.identity/use: "true"

Metrics Exposure for Prometheus

The Collate hybrid agent exposes operational metrics in a Prometheus-compatible format via an HTTP endpoint. These metrics are designed to support observability, enabling integration with Prometheus and similar monitoring systems. The exposed metrics provide insight into agent state, activity, and performance. These may evolve over time, and users are encouraged to inspect the /metrics endpoint directly for the latest set of available metrics.

Example

An example metric exposed by the agent:

# HELP collate_hybrid_agent_connected Is the agent connected to the server? (0 = No, 1 = Yes)
# TYPE collate_hybrid_agent_connected gauge
collate_hybrid_agent_connected 1.0

Configuration

Metrics exposure can be configured via the following settings:

metricsServerConfiguration:
  port: ${METRICS_SERVER_PORT:-8989}
  path: ${METRICS_SERVER_PATH:-/metrics}

  • port: Port on which the metrics endpoint will be served (default: 8989)
  • path: HTTP path for accessing metrics (default: /metrics)
These parameters support environment variable overrides for flexibility across deployment environments.

Accessing Metrics

Once configured, metrics can be accessed at the following endpoint:

http://<agent-host>:<port>/<path>

For example, with default settings:

http://localhost:8989/metrics