Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getcollate.io/llms.txt

Use this file to discover all available pages before exploring further.

Troubleshooting

Use this section to diagnose the most common issues after deployment. For each symptom, work through the steps in order.

Could Not Get the Secret Value or Forbidden

ERROR (metadata.utils.kubernetes_secrets_manager:159) - Could not get the secret
value of <path>
Reason: Forbidden
Your ingestion pod’s service account doesn’t have permission to read the secret. Work through the causes below to find the root issue, then confirm which secrets manager your Runner is using by reviewing the pod logs:
kubectl logs -l app.kubernetes.io/name=hybrid-ingestion-runner,app.kubernetes.io/instance=collate-prod | grep secretsManager

Cause 1 — Missing IAM or Workload Identity

Your ingestion service account isn’t bound to the correct IAM (Identity and Access Management) role or Workload Identity. Do the following checks:
  • Verify the annotation on the ingestion service account.
  • Confirm the cloud IAM binding is in place for your provider.

Cause 2 — Secret Name Mismatch

The name you entered in the Collate UI doesn’t match the name under which the secret is stored in your secrets store. When you enter secret:my-db-password in the Collate UI, the runner strips the secret: prefix and looks up my-db-password directly in your secrets store. If the secret was stored under a different name — for example, with a path prefix like /collate/hybrid-ingestion-runner/my-db-password — the lookup fails because the runner is searching for my-db-password, not the full path. Do the following checks:
  • Open your secrets store (AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager) and confirm the exact name the secret is stored under.
  • In the Collate UI connection form, verify the masked field contains secret:<secret-name>, where <secret-name> matches the name in your secrets store character for character.
  • Check for typos, extra slashes, or path segments that aren’t part of the stored secret name.

Cause 3 — Missing secretsManager Helm Value

To use a cloud secrets manager, set config.secretsManager explicitly in your values.yaml. Without it, the Runner falls back to Kubernetes Secrets and can’t resolve cloud secrets manager paths. Do the following steps:
  1. Open your values.yaml.
  2. Confirm config.secretsManager is set to the correct value for your provider (managed-aws, gcp, or managed-azure-kv).
  3. Run helm upgrade to apply the change.

Runner Shows as Inactive in the Collate UI

  • Check that the authToken in values.yaml is the correct and unexpired JWT from the IngestionBot.
  • Verify outbound TLS (port 443) is allowed from your cluster to <your-instance>.getcollate.io.
  • Confirm the pod is running: kubectl get pods.
  • Check the Runner pod logs for connection or authentication errors:
    kubectl logs -l app.kubernetes.io/name=hybrid-ingestion-runner,app.kubernetes.io/instance=collate-prod
    

ImagePullBackOff on the Runner Pod

The ECR credentials cron job may not have run yet. Trigger it manually:
kubectl create job --from=cronjob/ecr-registry-helper manual

Ingestion Pod Not Found — Diagnostics Unavailable

Issue: The ingestion job fails and the exit handler reports:
Could not retrieve pod diagnostics (pod may be deleted, missing RBAC
permissions, or other Kubernetes errors).

WARNING - No main pod found for workflow <workflow-id>
WARNING - Could not find main pod for workflow <workflow-id> - skipping diagnostics
Cause: The ingestion pod was removed before the exit handler could retrieve diagnostics. This is different from an application crash — a crashed or OOM-killed pod remains in Errored or OOMKilled state. An absent pod or ContainerStatusUnknown state indicates the pod was removed externally, typically by one of the following:
  • Cluster autoscaling scaled down the node running the ingestion pod.
  • A pod cleanup policy or TTL controller removed the pod.
  • The node was rotated or replaced during the ingestion run.
Resolution:
  1. Check the pod state immediately after the next failure:
    kubectl get pods -n argo-workflows
    kubectl describe pod <ingestion-pod-name> -n argo-workflows
    
  2. Review cluster events around the time of failure:
    kubectl get events -n argo-workflows --sort-by='.lastTimestamp'
    
  3. Once identified, work with your infrastructure team to address the cause — for example, configuring scale-down protection for ingestion workloads or excluding the argo-workflows namespace from pod cleanup policies.
If the pod is absent, it was removed by an external process before Argo’s configured TTL. Check your ARGO_SECONDS_AFTER_COMPLETION_TTL setting to confirm the expected retention window.