Guide to Deploy Collate Binaries in AWS
This guide will help you start using Collate Docker Images to run the OpenMetadata Application in Kubernetes on Amazon EKS, connecting with Argo Workflows for running ingestion from the OpenMetadata Application itself.
Architecture
Collate OpenMetadata requires 4 components:
- Collate Server
- Database — Collate Server stores the metadata in a relational database. We support MySQL or Postgres. Amazon RDS is recommended for production.
- MySQL version 8.0.0 or greater
- Postgres version 12.0 or greater
- Search Engine — We support:
- ElasticSearch 9.3.0
- OpenSearch 3.4 (Amazon OpenSearch Service recommended)
- Workflow Orchestration — We use Argo Workflows as the orchestrator for ingestion pipelines.
Sizing Requirements
Hardware Requirements
A Kubernetes Cluster with at least 1 Master Node and 3 Worker Nodes is the required configuration. Each Worker Node should have at least:
- 4 vCPUs
- 16 GiB Memory
- 128 GiB Storage capacity
Software Requirements
- Collate OpenMetadata supports Kubernetes Cluster version 1.24 or greater.
- Collate Docker Images are available via private AWS Elastic Container Registry (ECR). The Collate Team will share credentials and steps to configure Kubernetes to pull Docker Images from AWS ECR.
- For Argo Workflows, Collate OpenMetadata is currently compatible with application version 3.4+.
Recommended AWS Instance Types
| Component | Instance Type |
|---|
| Collate Server | t4g.large / m6a.large |
| Argo Workflows runners | m7i.large |
Database Sizing and Capacity
Our recommendation is to configure Amazon RDS PostgreSQL. For 100,000 Data Assets and 1,000 Users:
- 8 vCPUs
- 64 GiB Memory
- 256 GiB Storage Capacity
- 3,500 IOPS storage
Search Client Sizing and Capacity
For 100,000 Data Assets and 1,000 Users:
- 8 vCPUs
- 64 GiB Memory
- 256 GiB Storage Capacity
Use Amazon OpenSearch Service for production. Recommend multiple availability zones with a minimum of 2 nodes.
Argo Workflows Ingestion Runners
The recommended resources are 4 vCPUs and 16 GiB of Memory. Ingestion workloads can be scheduled on spot instances to reduce costs.
AWS Prerequisites
Enable EKS OIDC Provider
EKS clusters use an OIDC provider to enable IAM Roles for Service Accounts (IRSA). Check if your cluster already has one:
aws eks describe-cluster --name <CLUSTER_NAME> --query "cluster.identity.oidc.issuer" --output text
If no output is returned, associate an OIDC provider:
eksctl utils associate-iam-oidc-provider \
--cluster <CLUSTER_NAME> \
--region <AWS_REGION> \
--approve
Retrieve the OIDC issuer URL for use in subsequent steps:
OIDC_ISSUER=$(aws eks describe-cluster --name <CLUSTER_NAME> \
--query "cluster.identity.oidc.issuer" --output text | sed 's|https://||')
echo "OIDC Issuer: $OIDC_ISSUER"
Create an S3 Bucket for Argo Workflows Artifacts
Argo Workflows archives ingestion logs to S3:
aws s3 mb s3://collate-argo-artifacts-<AWS_REGION> --region <AWS_REGION>
Create IAM Roles for Service Accounts
We need 4 IAM roles — one for each service account (Collate Server, Collate Ingestion, Argo Controller, Argo Server).
IAM Role for Argo Workflows Controller
cat > argo-controller-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_ISSUER}:sub": "system:serviceaccount:argo-workflows:argo-workflows-controller-sa"
}
}
}]
}
EOF
aws iam create-role \
--role-name collate-argo-controller-role \
--assume-role-policy-document file://argo-controller-trust-policy.json
aws iam attach-role-policy \
--role-name collate-argo-controller-role \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
IAM Role for Argo Workflows Server
cat > argo-server-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_ISSUER}:sub": "system:serviceaccount:argo-workflows:argo-workflows-server-sa"
}
}
}]
}
EOF
aws iam create-role \
--role-name collate-argo-server-role \
--assume-role-policy-document file://argo-server-trust-policy.json
aws iam attach-role-policy \
--role-name collate-argo-server-role \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
IAM Role for Collate Server Application
cat > collate-server-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_ISSUER}:sub": "system:serviceaccount:collate:openmetadata"
}
}
}]
}
EOF
aws iam create-role \
--role-name collate-server-role \
--assume-role-policy-document file://collate-server-trust-policy.json
aws iam attach-role-policy \
--role-name collate-server-role \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
IAM Role for Collate Ingestion
cat > collate-ingestion-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ISSUER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_ISSUER}:sub": "system:serviceaccount:collate:om-role"
}
}
}]
}
EOF
aws iam create-role \
--role-name collate-ingestion-role \
--assume-role-policy-document file://collate-ingestion-trust-policy.json
aws iam attach-role-policy \
--role-name collate-ingestion-role \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
Setup AWS ECR
Collate will provide the credentials to pull Docker Images from a private registry located in AWS ECR.
Install AWS CLI
Follow the AWS CLI installation guide to install AWS CLI on your machine.
aws configure --profile ecr-collate
The command will prompt for credentials. The Collate team will securely share these via a 1Password link.
Confirm the credentials are correctly set:
aws configure list --profile ecr-collate
Kubernetes Docker Registry Secrets for AWS ECR
kubectl create secret docker-registry ecr-registry-creds \
--docker-server=118146679784.dkr.ecr.eu-west-1.amazonaws.com \
--docker-username=AWS \
--docker-password=$(aws ecr get-login-password --profile ecr-collate) \
--namespace <<NAMESPACE_NAME>>
Replace <<NAMESPACE_NAME>> with the namespace where you want to deploy Collate OpenMetadata Server. If the namespace does not exist yet, create it first with kubectl create namespace <<NAMESPACE_NAME>>.
AWS ECR Token RefreshECR tokens expire after 12 hours. If a pod is rescheduled to another node after 12 hours, you will get an ImagePullBackOff error. Delete the secret and recreate it using the command above.
Install Argo Workflows
Add Helm Repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
Create the Argo Namespace
kubectl create namespace argo-workflows
Kubernetes Secret for Argo Workflows DB Credentials
kubectl create secret generic argo-db-credentials \
--from-literal=username=<DB_USERNAME> \
--from-literal=password=<DB_PASSWORD> \
--namespace argo-workflows
Create Custom Helm Values for Argo Workflows
Create a file named argo-workflows.values.yml:
# argo-workflows.values.yml
controller:
serviceAccount:
create: true
name: argo-workflows-controller-sa
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-argo-controller-role"
name: workflow-controller
workflowDefaults:
spec:
serviceAccountName: om-role
server:
serviceAccount:
create: true
name: argo-workflows-server-sa
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-argo-server-role"
extraArgs:
- "--auth-mode=server"
- "--request-timeout=5m"
persistence:
archive: true
postgresql:
host: <DATABASE_INSTANCE_ENDPOINT>
database: <DATABASE_NAME>
tableName: argo_workflows
userNameSecret:
name: argo-db-credentials
key: username
passwordSecret:
name: argo-db-credentials
key: password
ssl: true
sslMode: require
useDefaultArtifactRepo: true
useStaticCredentials: false
artifactRepository:
archiveLogs: true
s3:
endpoint: s3.amazonaws.com
bucket: collate-argo-artifacts-<AWS_REGION>
keyFormat: 'workflows/{{workflow.namespace}}/{{workflow.name}}/{{pod.name}}'
insecure: false
region: <AWS_REGION>
encryptionOptions:
enableEncryption: true
For further customisation, refer to the community Helm chart values.
Deploy Argo Workflows
We target application version 3.7.1 using Helm chart version 0.45.23 (Artifact Hub):
helm upgrade --install argo-workflows argo/argo-workflows \
--version 0.45.23 \
--namespace argo-workflows \
--values argo-workflows.values.yml
[Optional] Enable Prometheus Metrics
If you have a Prometheus Application running on your cluster, enable metrics using:
controller:
serviceMonitor:
enabled: true
server:
serviceMonitor:
enabled: true
Refer to the official Argo Workflows documentation for further configuration.
Create the Collate Namespace
kubectl create namespace collate
Kubernetes Service Account for Ingestion
kubectl create serviceaccount om-role -n collate
Annotate Service Account with IRSA Role
kubectl annotate serviceaccount -n collate om-role \
eks.amazonaws.com/role-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-ingestion-role
Create Long-Lived API Token for the ServiceAccount
kubectl apply -n collate -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: om-role.service-account-token
annotations:
kubernetes.io/service-account.name: om-role
type: kubernetes.io/service-account-token
EOF
Create a file om-argo-role.yml:
# om-argo-role.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: om-argo-role
namespace: collate
rules:
- verbs: [list, watch, create, update, patch, get, delete]
apiGroups:
- argoproj.io
resources:
- workflows
- verbs: [list, watch, patch, get]
apiGroups:
- ''
resources:
- pods/log
- pods
- verbs: [list, watch, create, update, patch, get, delete]
apiGroups:
- argoproj.io
resources:
- cronworkflows
- verbs: [create, patch]
apiGroups:
- argoproj.io
resources:
- workflowtaskresults
Apply the role and create the role binding:
kubectl apply -f om-argo-role.yml
kubectl create rolebinding om-argo-role-binding \
--role=om-argo-role \
--serviceaccount=collate:om-role \
--namespace collate
Create Kubernetes Secrets for the database connection:
kubectl create secret generic db-credentials \
--from-literal=password=<<DATABASE_PASSWORD>> \
--namespace collate
Add the Helm chart repository:
helm repo add open-metadata https://helm.open-metadata.org/
helm repo update
If you plan to use the DeltaLake connector, the ARGO_INGESTION_IMAGE value should be:
118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-ingestion-eu-west-1:om-1.12.3-cl-1.12.3
Create a file openmetadata.values.yml:
# openmetadata.values.yml
replicaCount: 1
openmetadata:
config:
elasticsearch:
host: ${es_host}
port: ${es_port}
scheme: ${es_scheme}
searchType: opensearch
auth:
enabled: true
username: ${es_username}
password:
secretRef: es-credentials
secretKey: password
database:
host: ${db_host}
port: ${db_port}
driverClass: org.postgresql.Driver
dbScheme: postgresql
auth:
username: ${db_user}
password:
secretRef: db-credentials
secretKey: password
dbParams: "allowPublicKeyRetrieval=true&useSSL=true&serverTimezone=UTC"
pipelineServiceClientConfig:
className: "io.collate.pipeline.argo.ArgoServiceClient"
apiEndpoint: "http://argo-workflows-server.argo-workflows:2746"
metadataApiEndpoint: "http://openmetadata:8585/api"
auth:
enabled: false
image:
repository: 118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-eu-west-1
tag: om-1.12.3-cl-1.12.3
imagePullPolicy: IfNotPresent
imagePullSecrets:
- name: ecr-registry-creds
extraEnvs:
- name: ARGO_NAMESPACE
value: collate
- name: ARGO_TOKEN
valueFrom:
secretKeyRef:
name: "om-role.service-account-token"
key: "token"
- name: ARGO_INGESTION_IMAGE
value: "118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-ingestion-slim-eu-west-1:om-1.12.3-cl-1.12.3"
- name: ARGO_WORKFLOW_EXECUTOR_SERVICE_ACCOUNT_NAME
value: om-role
- name: ARGO_IMAGE_PULL_SECRETS
value: ecr-registry-creds
- name: ASSET_UPLOADER_PROVIDER
value: "s3"
- name: ASSET_UPLOADER_MAX_FILE_SIZE
value: "10485760"
- name: ASSET_UPLOADER_S3_BUCKET_NAME
value: "<S3_BUCKET_NAME>"
- name: ASSET_UPLOADER_S3_REGION
value: "<AWS_REGION>"
- name: ASSET_UPLOADER_S3_PREFIX_PATH
value: "assets/collate"
serviceAccount:
name: "openmetadata"
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<AWS_ACCOUNT_ID>:role/collate-server-role
Install the Collate OpenMetadata Application:
helm upgrade --install openmetadata open-metadata/openmetadata \
--values openmetadata.values.yml \
--namespace collate
[Optional] Enable Prometheus Metrics
Collate Application exposes Prometheus metrics on port 8586. Enable the integration using:
serviceMonitor:
enabled: true
Post Installation/Upgrade Steps
After installation or upgrade, configure ReIndexing from the OpenMetadata UI. For detailed steps, refer to the OpenMetadata upgrade documentation.
| Environment Name | Description | Default Value | Required |
|---|
ARGO_IMAGE_PULL_SECRETS | Image Pull Secret Name to pull Docker Images for Ingestion from a Private Registry. Multiple secrets can be supplied comma-separated. | Empty String | False |
ARGO_INGESTION_IMAGE | Docker Image and Tag for Ingestion Images | openmetadata/ingestion-base:1.4.3 | True |
ARGO_NAMESPACE | Namespace in which Argo Workflows will be executed. Must match the namespace where OpenMetadata is deployed. | argo | True |
ARGO_SERVER_CERTIFICATE_PATH | SSL Certificate Path to connect to Argo Server | Empty String | False |
ARGO_TEST_CONNECTION_BACKOFF_TIME | Backoff retry time in seconds to test the connection | 5 | False |
ARGO_TOKEN | JWT Token to authenticate with Argo Workflow API | Empty String | True |
ARGO_WORKFLOW_CPU_LIMIT | Kubernetes CPU Limits for Argo Workflows created with Ingestion | 1000m | False |
ARGO_WORKFLOW_CPU_REQUEST | Kubernetes CPU Requests for Argo Workflows created with Ingestion | 200m | False |
ARGO_WORKFLOW_CUSTOMER_TOLERATION | Kubernetes Node Toleration to schedule Ingestion Workflow Pods to specific Nodes | argo | False |
ARGO_WORKFLOW_EXECUTOR_SERVICE_ACCOUNT_NAME | Service Account Name to be used for Argo Workflows for Ingestion | om-role | True |
ARGO_WORKFLOW_MEMORY_LIMIT | Kubernetes Memory Limits for Argo Workflows created with Ingestion | 4096Mi | False |
ARGO_WORKFLOW_MEMORY_REQUEST | Kubernetes Memory Requests for Argo Workflows created with Ingestion | 256Mi | False |
ASSET_UPLOADER_ENABLE | Enable Asset Upload Feature | True | False |
ASSET_UPLOADER_PROVIDER | Asset Upload Provider Name. Can be s3 or azure. | s3 | False |
ASSET_UPLOADER_MAX_FILE_SIZE | Max File Size to support for Asset Upload (in bytes) | 5242880 | False |
ASSET_UPLOADER_S3_BUCKET_NAME | Asset Upload S3 Bucket Name | Empty String | False |
ASSET_UPLOADER_S3_REGION | Asset Upload S3 Region | Empty String | False |
ASSET_UPLOADER_S3_PREFIX_PATH | Asset Upload S3 Prefix Path | assets/default | False |
Appendix: List of AWS ECR Public IPs
If your company policy blocks access to external resources, ensure the public IPs of AWS ECR are reachable from your cluster.
data "aws_ip_ranges" "ip_ranges" {
regions = ["eu-west-1"]
services = ["amazon"]
}
output "ireland_ip_ranges" {
value = data.aws_ip_ranges.ip_ranges.cidr_blocks
}
Using curl and jq
curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | select(.region=="eu-west-1") | select(.service=="AMAZON")'