Guide to Deploy Collate Binaries in AWS

This guide will help you start using Collate Docker Images to run the OpenMetadata Application in Kubernetes on Amazon EKS, connecting with Argo Workflows for running ingestion from the OpenMetadata Application itself.

Architecture

Collate OpenMetadata requires 4 components:

Collate Server
Database — Collate Server stores the metadata in a relational database. We support MySQL or Postgres. Amazon RDS is recommended for production.
- MySQL version 8.0.42 or greater
- Postgres version 17.6 or greater
Search Engine — OpenSearch 3.4 (Amazon OpenSearch Service recommended). ElasticSearch is not supported in Collate BYOC because Collate AI relies on OpenSearch’s vector capabilities for Semantic and Hybrid Search.
Workflow Orchestration — We use Argo Workflows as the orchestrator for ingestion pipelines.

Sizing Requirements

Hardware Requirements

A Kubernetes Cluster with at least 1 Master Node and 3 Worker Nodes is the required configuration. Each Worker Node should have at least:

4 vCPUs
16 GiB Memory
128 GiB Storage capacity

Software Requirements

Collate OpenMetadata supports Kubernetes Cluster version 1.24 or greater.
Collate Docker Images are available via private AWS Elastic Container Registry (ECR). The Collate Team will share credentials and steps to configure Kubernetes to pull Docker Images from AWS ECR.
For Argo Workflows, Collate OpenMetadata is currently compatible with application version 3.4+.

Recommended AWS Instance Types

Component	Instance Type
Collate Server	t4g.large / m6a.large
Argo Workflows runners	m7i.large

Database Sizing and Capacity

Our recommendation is to configure Amazon RDS PostgreSQL. For 100,000 Data Assets and 1,000 Users:

8 vCPUs
64 GiB Memory
256 GiB Storage Capacity
3,500 IOPS storage

Search Client Sizing and Capacity

For 100,000 Data Assets and 1,000 Users:

8 vCPUs
64 GiB Memory
256 GiB Storage Capacity

Use Amazon OpenSearch Service for production. Recommend multiple availability zones with a minimum of 2 nodes.

Argo Workflows Ingestion Runners

The recommended resources are 4 vCPUs and 16 GiB of Memory. Ingestion workloads can be scheduled on spot instances to reduce costs.

AWS Prerequisites

Enable EKS OIDC Provider

EKS clusters use an OIDC provider to enable IAM Roles for Service Accounts (IRSA). Check if your cluster already has one:

aws eks describe-cluster --name <CLUSTER_NAME> --query "cluster.identity.oidc.issuer" --output text

If no output is returned, associate an OIDC provider:

eksctl utils associate-iam-oidc-provider \
  --cluster <CLUSTER_NAME> \
  --region <AWS_REGION> \
  --approve

Retrieve the OIDC issuer URL for use in subsequent steps:

OIDC_ISSUER=$(aws eks describe-cluster --name <CLUSTER_NAME> \
  --query "cluster.identity.oidc.issuer" --output text | sed 's|https://||')
echo "OIDC Issuer: $OIDC_ISSUER"

Create an S3 Bucket for Argo Workflows Artifacts

Argo Workflows archives ingestion logs to S3:

aws s3 mb s3://collate-argo-artifacts-<AWS_REGION> --region <AWS_REGION>

Create IAM Roles for Service Accounts

We need 4 IAM roles — one for each service account (Collate Server, Collate Ingestion, Argo Controller, Argo Server).

IAM Role for Argo Workflows Controller

cat > argo-controller-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ISSUER}:sub": "system:serviceaccount:argo-workflows:argo-workflows-controller-sa"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name collate-argo-controller-role \
  --assume-role-policy-document file://argo-controller-trust-policy.json

aws iam attach-role-policy \
  --role-name collate-argo-controller-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

IAM Role for Argo Workflows Server

cat > argo-server-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ISSUER}:sub": "system:serviceaccount:argo-workflows:argo-workflows-server-sa"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name collate-argo-server-role \
  --assume-role-policy-document file://argo-server-trust-policy.json

aws iam attach-role-policy \
  --role-name collate-argo-server-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

IAM Role for Collate Server Application

cat > collate-server-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ISSUER}:sub": "system:serviceaccount:collate:openmetadata"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name collate-server-role \
  --assume-role-policy-document file://collate-server-trust-policy.json

aws iam attach-role-policy \
  --role-name collate-server-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

IAM Role for Collate Ingestion

cat > collate-ingestion-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ISSUER}:sub": "system:serviceaccount:collate:om-role"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name collate-ingestion-role \
  --assume-role-policy-document file://collate-ingestion-trust-policy.json

aws iam attach-role-policy \
  --role-name collate-ingestion-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Setup AWS ECR

Collate will provide the credentials to pull Docker Images from a private registry located in AWS ECR.

Install AWS CLI

Follow the AWS CLI installation guide to install AWS CLI on your machine.

Configure AWS Credentials

aws configure --profile ecr-collate

The command will prompt for credentials. The Collate team will securely share these via a 1Password link. Confirm the credentials are correctly set:

aws configure list --profile ecr-collate

Kubernetes Docker Registry Secrets for AWS ECR

kubectl create secret docker-registry ecr-registry-creds \
  --docker-server=118146679784.dkr.ecr.eu-west-1.amazonaws.com \
  --docker-username=AWS \
  --docker-password=$(aws ecr get-login-password --profile ecr-collate) \
  --namespace <<NAMESPACE_NAME>>

Replace <<NAMESPACE_NAME>> with the namespace where you want to deploy Collate OpenMetadata Server. If the namespace does not exist yet, create it first with kubectl create namespace <<NAMESPACE_NAME>>.

AWS ECR Token RefreshECR tokens expire after 12 hours. If a pod is rescheduled to another node after 12 hours, you will get an ImagePullBackOff error. Delete the secret and recreate it using the command above.

Install Argo Workflows

Add Helm Repository

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

Create the Argo Namespace

kubectl create namespace argo-workflows

Kubernetes Secret for Argo Workflows DB Credentials

kubectl create secret generic argo-db-credentials \
  --from-literal=username=<DB_USERNAME> \
  --from-literal=password=<DB_PASSWORD> \
  --namespace argo-workflows

Create Custom Helm Values for Argo Workflows

Create a file named argo-workflows.values.yml:

# argo-workflows.values.yml
controller:
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"
  serviceAccount:
    create: true
    name: argo-workflows-controller-sa
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-argo-controller-role"
  name: workflow-controller
  workflowDefaults:
    spec:
      serviceAccountName: om-role
  persistence:
    archive: true
    postgresql:
      host: <DATABASE_INSTANCE_ENDPOINT>
      database: <DATABASE_NAME>
      tableName: argo_workflows
      userNameSecret:
        name: argo-db-credentials
        key: username
      passwordSecret:
        name: argo-db-credentials
        key: password
      ssl: true
      sslMode: require

server:
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"
  serviceAccount:
    create: true
    name: argo-workflows-server-sa
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-argo-server-role"
  extraArgs:
    - "--auth-mode=server"
    - "--request-timeout=5m"

useDefaultArtifactRepo: true
useStaticCredentials: false
artifactRepository:
  archiveLogs: true
  s3:
    endpoint: s3.amazonaws.com
    bucket: collate-argo-artifacts-<AWS_REGION>
    keyFormat: 'workflows/{{workflow.namespace}}/{{workflow.name}}/{{pod.name}}'
    insecure: false
    region: <AWS_REGION>
    encryptionOptions:
      enableEncryption: true

For further customisation, refer to the community Helm chart values.

Deploy Argo Workflows

We target application version 3.7.1 using Helm chart version 0.45.23 (Artifact Hub):

helm upgrade --install argo-workflows argo/argo-workflows \
  --version 0.45.23 \
  --namespace argo-workflows \
  --values argo-workflows.values.yml

[Optional] Enable Prometheus Metrics

If you have a Prometheus Application running on your cluster, enable metrics using:

controller:
  serviceMonitor:
    enabled: true
server:
  serviceMonitor:
    enabled: true

Refer to the official Argo Workflows documentation for further configuration.

Install OpenMetadata/Collate

Create the Collate Namespace

kubectl create namespace collate

Kubernetes Service Account for Ingestion

kubectl create serviceaccount om-role -n collate

Annotate Service Account with IRSA Role

kubectl annotate serviceaccount -n collate om-role \
  eks.amazonaws.com/role-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-ingestion-role

Create Long-Lived API Token for the ServiceAccount

kubectl apply -n collate -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: om-role.service-account-token
  annotations:
    kubernetes.io/service-account.name: om-role
type: kubernetes.io/service-account-token
EOF

Configure Kubernetes Roles for the Service Account

Create a file om-argo-role.yml:

# om-argo-role.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: om-argo-role
  namespace: collate
rules:
  - verbs: [list, watch, create, update, patch, get, delete]
    apiGroups:
      - argoproj.io
    resources:
      - workflows
  - verbs: [list, watch, patch, get]
    apiGroups:
      - ''
    resources:
      - pods/log
      - pods
  - verbs: [list, watch, create, update, patch, get, delete]
    apiGroups:
      - argoproj.io
    resources:
      - cronworkflows
  - verbs: [create, patch]
    apiGroups:
      - argoproj.io
    resources:
      - workflowtaskresults

Apply the role and create the role binding:

kubectl apply -f om-argo-role.yml

kubectl create rolebinding om-argo-role-binding \
  --role=om-argo-role \
  --serviceaccount=collate:om-role \
  --namespace collate

Install OpenMetadata Helm Chart

Create Kubernetes Secrets for the database connection:

kubectl create secret generic db-credentials \
  --from-literal=password=<<DATABASE_PASSWORD>> \
  --namespace collate

Add the Helm chart repository:

helm repo add open-metadata https://helm.open-metadata.org/
helm repo update

If you plan to use the DeltaLake connector, the ARGO_INGESTION_IMAGE value should be: 118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-ingestion-eu-west-1:om-1.12.8-cl-1.12.8

Create a file openmetadata.values.yml:

# openmetadata.values.yml
replicaCount: 1
resources:
  limits:
    cpu: 3000m
    memory: 12Gi
  requests:
    cpu: 1000m
    memory: 10Gi
openmetadata:
  config:
    elasticsearch:
      host: ${es_host}
      port: ${es_port}
      scheme: ${es_scheme}
      searchType: opensearch
      auth:
        enabled: true
        username: ${es_username}
        password:
          secretRef: es-credentials
          secretKey: password
    database:
      host: ${db_host}
      port: ${db_port}
      driverClass: org.postgresql.Driver
      dbScheme: postgresql
      auth:
        username: ${db_user}
        password:
          secretRef: db-credentials
          secretKey: password
      dbParams: "allowPublicKeyRetrieval=true&useSSL=true&serverTimezone=UTC"
    pipelineServiceClientConfig:
      enabled: true
      type: "argoWorkflows"
      metadataApiEndpoint: "http://openmetadata:8585/api"
      argoWorkflows:
        namespace: collate
        serviceAccountName: om-role
        ingestionImage: "118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-ingestion-slim-eu-west-1:om-1.12.8-cl-1.12.8"
        imagePullPolicy: "IfNotPresent"
        imagePullSecrets: "ecr-registry-creds"
        apiEndpoint: "http://argo-workflows-server.argo-workflows:2746"
image:
  repository: 118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-eu-west-1
  tag: om-1.12.8-cl-1.12.8
  imagePullPolicy: IfNotPresent
imagePullSecrets:
  - name: ecr-registry-creds
extraEnvs:
  - name: ARGO_TOKEN
    valueFrom:
      secretKeyRef:
        name: "om-role.service-account-token"
        key: "token"
  - name: OPENMETADATA_HEAP_OPTS
    value: "-Xmx8G -Xms8G"
  - name: ASSET_UPLOADER_PROVIDER
    value: "s3"
  - name: ASSET_UPLOADER_MAX_FILE_SIZE
    value: "10485760"
  - name: ASSET_UPLOADER_S3_BUCKET_NAME
    value: "<S3_BUCKET_NAME>"
  - name: ASSET_UPLOADER_S3_REGION
    value: "<AWS_REGION>"
  - name: ASSET_UPLOADER_S3_PREFIX_PATH
    value: "assets/collate"
serviceAccount:
  name: "openmetadata"
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-server-role

Install the Collate OpenMetadata Application:

helm upgrade --install openmetadata open-metadata/openmetadata \
  --values openmetadata.values.yml \
  --namespace collate

[Optional] Enable Prometheus Metrics

Collate Application exposes Prometheus metrics on port 8586. Enable the integration using:

serviceMonitor:
  enabled: true

Post Installation/Upgrade Steps

Configure ReIndexing

After installation or upgrade, configure ReIndexing from the OpenMetadata UI. For detailed steps, refer to the OpenMetadata upgrade documentation.

Environment Variables for Collate OpenMetadata Argo

Environment Name	Description	Default Value	Required
`ARGO_IMAGE_PULL_SECRETS`	Image Pull Secret Name to pull Docker Images for Ingestion from a Private Registry. Multiple secrets can be supplied comma-separated.	Empty String	False
`ARGO_INGESTION_IMAGE`	Docker Image and Tag for Ingestion Images	`openmetadata/ingestion-base:1.4.3`	True
`ARGO_NAMESPACE`	Namespace in which Argo Workflows will be executed. Must match the namespace where OpenMetadata is deployed.	`argo`	True
`ARGO_SERVER_CERTIFICATE_PATH`	SSL Certificate Path to connect to Argo Server	Empty String	False
`ARGO_TEST_CONNECTION_BACKOFF_TIME`	Backoff retry time in seconds to test the connection	`5`	False
`ARGO_TOKEN`	JWT Token to authenticate with Argo Workflow API	Empty String	True
`ARGO_WORKFLOW_CPU_LIMIT`	Kubernetes CPU Limits for Argo Workflows created with Ingestion	`1000m`	False
`ARGO_WORKFLOW_CPU_REQUEST`	Kubernetes CPU Requests for Argo Workflows created with Ingestion	`200m`	False
`ARGO_WORKFLOW_CUSTOMER_TOLERATION`	Kubernetes Node Toleration to schedule Ingestion Workflow Pods to specific Nodes	`argo`	False
`ARGO_WORKFLOW_EXECUTOR_SERVICE_ACCOUNT_NAME`	Service Account Name to be used for Argo Workflows for Ingestion	`om-role`	True
`ARGO_WORKFLOW_MEMORY_LIMIT`	Kubernetes Memory Limits for Argo Workflows created with Ingestion	`4096Mi`	False
`ARGO_WORKFLOW_MEMORY_REQUEST`	Kubernetes Memory Requests for Argo Workflows created with Ingestion	`256Mi`	False
`ASSET_UPLOADER_ENABLE`	Enable Asset Upload Feature	`True`	False
`ASSET_UPLOADER_PROVIDER`	Asset Upload Provider Name. Can be `s3` or `azure`.	`s3`	False
`ASSET_UPLOADER_MAX_FILE_SIZE`	Max File Size to support for Asset Upload (in bytes)	`5242880`	False
`ASSET_UPLOADER_S3_BUCKET_NAME`	Asset Upload S3 Bucket Name	Empty String	False
`ASSET_UPLOADER_S3_REGION`	Asset Upload S3 Region	Empty String	False
`ASSET_UPLOADER_S3_PREFIX_PATH`	Asset Upload S3 Prefix Path	`assets/default`	False

Appendix: List of AWS ECR Public IPs

If your company policy blocks access to external resources, ensure the public IPs of AWS ECR are reachable from your cluster.

Using Terraform

data "aws_ip_ranges" "ip_ranges" {
  regions  = ["eu-west-1"]
  services = ["amazon"]
}

output "ireland_ip_ranges" {
  value = data.aws_ip_ranges.ip_ranges.cidr_blocks
}

Using curl and jq

curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | select(.region=="eu-west-1") | select(.service=="AMAZON")'

Documentation Index

​Guide to Deploy Collate Binaries in AWS

​Architecture

​Sizing Requirements

​Hardware Requirements

​Software Requirements

​Recommended AWS Instance Types

​Database Sizing and Capacity

​Search Client Sizing and Capacity

​Argo Workflows Ingestion Runners

​AWS Prerequisites

​Enable EKS OIDC Provider

​Create an S3 Bucket for Argo Workflows Artifacts

​Create IAM Roles for Service Accounts

​IAM Role for Argo Workflows Controller

​IAM Role for Argo Workflows Server

​IAM Role for Collate Server Application

​IAM Role for Collate Ingestion

​Setup AWS ECR

​Install AWS CLI

​Configure AWS Credentials

​Kubernetes Docker Registry Secrets for AWS ECR

​Install Argo Workflows

​Add Helm Repository

​Create the Argo Namespace

​Kubernetes Secret for Argo Workflows DB Credentials

​Create Custom Helm Values for Argo Workflows

​Deploy Argo Workflows

​[Optional] Enable Prometheus Metrics

​Install OpenMetadata/Collate

​Create the Collate Namespace

​Kubernetes Service Account for Ingestion

​Annotate Service Account with IRSA Role

​Create Long-Lived API Token for the ServiceAccount

​Configure Kubernetes Roles for the Service Account

​Install OpenMetadata Helm Chart

​[Optional] Enable Prometheus Metrics

​Post Installation/Upgrade Steps

​Configure ReIndexing

​Environment Variables for Collate OpenMetadata Argo

​Appendix: List of AWS ECR Public IPs

​Using Terraform

​Using curl and jq

Guide to Deploy Collate Binaries in AWS

Architecture

Sizing Requirements

Hardware Requirements

Software Requirements

Recommended AWS Instance Types

Database Sizing and Capacity

Search Client Sizing and Capacity

Argo Workflows Ingestion Runners

AWS Prerequisites

Enable EKS OIDC Provider

Create an S3 Bucket for Argo Workflows Artifacts

Create IAM Roles for Service Accounts

IAM Role for Argo Workflows Controller

IAM Role for Argo Workflows Server

IAM Role for Collate Server Application

IAM Role for Collate Ingestion

Setup AWS ECR

Install AWS CLI

Configure AWS Credentials

Kubernetes Docker Registry Secrets for AWS ECR

Install Argo Workflows

Add Helm Repository

Create the Argo Namespace

Kubernetes Secret for Argo Workflows DB Credentials

Create Custom Helm Values for Argo Workflows

Deploy Argo Workflows

[Optional] Enable Prometheus Metrics

Install OpenMetadata/Collate

Create the Collate Namespace

Kubernetes Service Account for Ingestion

Annotate Service Account with IRSA Role

Create Long-Lived API Token for the ServiceAccount

Configure Kubernetes Roles for the Service Account

Install OpenMetadata Helm Chart

[Optional] Enable Prometheus Metrics

Post Installation/Upgrade Steps

Configure ReIndexing

Environment Variables for Collate OpenMetadata Argo

Appendix: List of AWS ECR Public IPs

Using Terraform

Using curl and jq