Skip to main content

dbt Artifact Storage: Google Cloud Storage Configuration

This guide walks you through configuring Google Cloud Storage (GCS) as the artifact storage layer for dbt Core + Collate integration. Perfect for Google Cloud Platform deployments.

Prerequisites Checklist

RequirementDetailsHow to Verify
GCP AccountWith permissions to create GCS bucketsgcloud auth list
gcloud CLIInstalled and configuredgcloud --version
dbt ProjectExisting dbt projectdbt debug
OrchestrationCloud Composer or AirflowAccess to DAG configuration
Database ServiceData warehouse already ingestedCheck Settings → Services

Step 1: GCS Setup

1.1 Create GCS Bucket

# Set your variables
export GCP_PROJECT="your-gcp-project-id"
export BUCKET_NAME="your-company-dbt-artifacts"
export REGION="us-central1"

# Set active project
gcloud config set project ${GCP_PROJECT}

# Create the bucket
gsutil mb -p ${GCP_PROJECT} -c STANDARD -l ${REGION} gs://${BUCKET_NAME}

# Verify bucket creation
gsutil ls | grep ${BUCKET_NAME}
Expected output:
gs://your-company-dbt-artifacts/

1.2 Create Service Account for dbt (Write Access)

Your dbt environment needs permission to write to GCS.
# Create service account for dbt
gcloud iam service-accounts create dbt-artifacts-writer \
    --display-name="dbt Artifacts Writer" \
    --project=${GCP_PROJECT}

# Grant Storage Object Creator role
gsutil iam ch \
    serviceAccount:dbt-artifacts-writer@${GCP_PROJECT}.iam.gserviceaccount.com:roles/storage.objectCreator \
    gs://${BUCKET_NAME}

# Create and download service account key
gcloud iam service-accounts keys create ~/dbt-sa-key.json \
    --iam-account=dbt-artifacts-writer@${GCP_PROJECT}.iam.gserviceaccount.com

echo "✓ Service account key saved to: ~/dbt-sa-key.json"
Store the service account key securely. Never commit it to version control.

1.3 Create Service Account for Collate (Read Access)

Collate needs permission to read from GCS.
# Create service account for Collate
gcloud iam service-accounts create collate-dbt-reader \
    --display-name="Collate dbt Reader" \
    --project=${GCP_PROJECT}

# Grant Storage Object Viewer role (read-only)
gsutil iam ch \
    serviceAccount:collate-dbt-reader@${GCP_PROJECT}.iam.gserviceaccount.com:roles/storage.objectViewer \
    gs://${BUCKET_NAME}

# Create and download service account key
gcloud iam service-accounts keys create ~/collate-sa-key.json \
    --iam-account=collate-dbt-reader@${GCP_PROJECT}.iam.gserviceaccount.com

echo "✓ Service account key saved to: ~/collate-sa-key.json"

1.4 (Alternative) Use Workload Identity on GKE

If running on GKE, use Workload Identity instead of service account keys:
# Create GCP Service Account
gcloud iam service-accounts create dbt-workload-identity \
    --project=${GCP_PROJECT}

# Grant bucket access
gsutil iam ch \
    serviceAccount:dbt-workload-identity@${GCP_PROJECT}.iam.gserviceaccount.com:roles/storage.objectCreator \
    gs://${BUCKET_NAME}

# Bind Kubernetes Service Account to GCP Service Account
gcloud iam service-accounts add-iam-policy-binding \
    dbt-workload-identity@${GCP_PROJECT}.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:${GCP_PROJECT}.svc.id.goog[namespace/k8s-sa-name]"

1.5 Verify GCS Access

# Set credentials
export GOOGLE_APPLICATION_CREDENTIALS=~/dbt-sa-key.json

# Create test file
echo "test" > /tmp/test.txt

# Upload to GCS
gsutil cp /tmp/test.txt gs://${BUCKET_NAME}/dbt/test.txt

# Verify it exists
gsutil ls gs://${BUCKET_NAME}/dbt/

# Clean up
gsutil rm gs://${BUCKET_NAME}/dbt/test.txt
rm /tmp/test.txt

Step 2: Upload Artifacts from dbt

2.1 Understanding dbt Artifacts

Collate requires these dbt-generated files:
FileGenerated ByRequired?What It Contains
manifest.jsondbt run, dbt compile, dbt buildYESModels, sources, lineage, descriptions, tests
catalog.jsondbt docs generateRecommendedColumn names, types, descriptions
run_results.jsondbt run, dbt test, dbt buildOptionalTest pass/fail results, timing
Generate all artifacts:
dbt run           # Generates manifest.json
dbt test          # Updates run_results.json
dbt docs generate # Generates catalog.json

2.2 Complete Cloud Composer DAG

This is a complete, working DAG for Cloud Composer or GKE-based Airflow. Save as dbt_with_gcs.py in your Cloud Composer DAGs folder:
"""
dbt + Collate Integration DAG (GCS Method)

This DAG:
1. Runs dbt models
2. Runs dbt tests
3. Generates dbt documentation (catalog.json)
4. Uploads all artifacts to Google Cloud Storage

Perfect for Cloud Composer or GKE deployments.
"""

import os
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.utils.task_group import TaskGroup


# =============================================================================
# CONFIGURATION
# =============================================================================

# dbt Configuration
DBT_PROJECT_DIR = os.getenv("DBT_PROJECT_DIR", "/home/airflow/gcs/dbt/my_project")
DBT_PROFILES_DIR = os.getenv("DBT_PROFILES_DIR", "/home/airflow/gcs/dbt")

# GCS Configuration
GCS_BUCKET = os.getenv("GCS_BUCKET", "your-company-dbt-artifacts")
GCS_PREFIX = os.getenv("GCS_PREFIX", "dbt")
GCP_PROJECT = os.getenv("GCP_PROJECT", "your-gcp-project")

# Service Account (if not using Workload Identity)
GOOGLE_APPLICATION_CREDENTIALS = os.getenv(
    "GOOGLE_APPLICATION_CREDENTIALS",
    "/home/airflow/gcs/dbt-sa-key.json"
)

# =============================================================================
# DAG DEFAULT ARGUMENTS
# =============================================================================

default_args = {
    "owner": "data-engineering",
    "depends_on_past": False,
    "email": ["data-team@yourcompany.com"],
    "email_on_failure": True,
    "email_on_retry": False,
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(hours=2),
}

# =============================================================================
# PYTHON FUNCTIONS
# =============================================================================

def upload_artifacts_to_gcs(**context):
    """
    Upload dbt artifacts to Google Cloud Storage.

    Uses google-cloud-storage library (pre-installed in Cloud Composer).
    For self-hosted: pip install google-cloud-storage
    """
    from google.cloud import storage

    # Initialize GCS client
    if os.path.exists(GOOGLE_APPLICATION_CREDENTIALS):
        client = storage.Client.from_service_account_json(
            GOOGLE_APPLICATION_CREDENTIALS
        )
    else:
        # Use default credentials (Workload Identity or ADC)
        client = storage.Client(project=GCP_PROJECT)

    bucket = client.bucket(GCS_BUCKET)
    target_dir = os.path.join(DBT_PROJECT_DIR, "target")

    # Files to upload
    artifacts = [
        ("manifest.json", True),      # Required
        ("catalog.json", False),      # Optional but recommended
        ("run_results.json", False),  # Optional
        ("sources.json", False),      # Optional
    ]

    uploaded = []
    failed = []

    for filename, required in artifacts:
        local_path = os.path.join(target_dir, filename)
        gcs_path = f"{GCS_PREFIX}/{filename}"

        if os.path.exists(local_path):
            try:
                blob = bucket.blob(gcs_path)
                blob.upload_from_filename(local_path)
                uploaded.append(filename)
                print(f"✓ Uploaded {filename} to gs://{GCS_BUCKET}/{gcs_path}")
            except Exception as e:
                error_msg = f"✗ Failed to upload {filename}: {e}"
                print(error_msg)
                if required:
                    raise Exception(error_msg)
                failed.append(filename)
        else:
            if required:
                raise FileNotFoundError(
                    f"Required artifact not found: {local_path}\n"
                    f"Make sure 'dbt run' completed successfully."
                )
            else:
                print(f"⊘ Skipping {filename} (not found - optional)")

    # Log summary
    print(f"\n{'='*50}")
    print(f"Upload Summary:")
    print(f"  Uploaded: {', '.join(uploaded) or 'None'}")
    print(f"  Skipped:  {', '.join(failed) or 'None'}")
    print(f"  GCS Location: gs://{GCS_BUCKET}/{GCS_PREFIX}/")
    print(f"{'='*50}")

    return {"uploaded": uploaded, "bucket": GCS_BUCKET, "prefix": GCS_PREFIX}


# =============================================================================
# DAG DEFINITION
# =============================================================================

with DAG(
    dag_id="dbt_with_gcs",
    default_args=default_args,
    description="Run dbt models and sync metadata to Collate via GCS",
    schedule_interval="0 6 * * *",  # Daily at 6 AM UTC
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=1,
    tags=["dbt", "collate", "gcs", "data-pipeline"],
) as dag:

    # Task Group: dbt Execution
    with TaskGroup(group_id="dbt_execution") as dbt_tasks:

        dbt_run = BashOperator(
            task_id="dbt_run",
            bash_command=f"""
                cd {DBT_PROJECT_DIR} && \
                dbt run --profiles-dir {DBT_PROFILES_DIR}
            """,
        )

        dbt_test = BashOperator(
            task_id="dbt_test",
            bash_command=f"""
                cd {DBT_PROJECT_DIR} && \
                dbt test --profiles-dir {DBT_PROFILES_DIR}
            """,
            trigger_rule="all_done",
        )

        dbt_docs = BashOperator(
            task_id="dbt_docs_generate",
            bash_command=f"""
                cd {DBT_PROJECT_DIR} && \
                dbt docs generate --profiles-dir {DBT_PROFILES_DIR}
            """,
        )

        dbt_run >> dbt_test >> dbt_docs

    # Upload to GCS
    upload_to_gcs = PythonOperator(
        task_id="upload_artifacts_to_gcs",
        python_callable=upload_artifacts_to_gcs,
        provide_context=True,
    )

    # DAG Dependencies
    dbt_tasks >> upload_to_gcs

2.3 Alternative: Simple gsutil Upload

For simpler setups, use gsutil directly in a BashOperator:
upload_with_gsutil = BashOperator(
    task_id="upload_to_gcs",
    bash_command=f"""
        cd {DBT_PROJECT_DIR}/target && \
        gsutil -m cp manifest.json catalog.json run_results.json \
            gs://{GCS_BUCKET}/{GCS_PREFIX}/ || true
    """,
)

2.4 Verify DAG Deployment

# For Cloud Composer - upload DAG
gcloud composer environments storage dags import \
    --environment your-composer-env \
    --location us-central1 \
    --source dbt_with_gcs.py

# Check GCS after DAG completes
gsutil ls gs://your-company-dbt-artifacts/dbt/
Expected output:
gs://your-company-dbt-artifacts/dbt/manifest.json
gs://your-company-dbt-artifacts/dbt/catalog.json
gs://your-company-dbt-artifacts/dbt/run_results.json

Step 3: Configure Collate

Configuration

  1. Go to Settings → Services → Database Services
  2. Click on your database service (e.g., “production-bigquery”)
  3. Go to the Ingestion tab
  4. Click Add Ingestion
  5. Select dbt from the dropdown
Configure dbt Source (GCS):
FieldValueNotes
dbt Configuration SourceGCSSelect from dropdown
GCS Bucket Nameyour-company-dbt-artifactsYour bucket name
GCS Object PrefixdbtFolder path (no leading /)
GCP Credentials: Upload the Collate service account key JSON:
  1. Click Upload Credentials
  2. Select ~/collate-sa-key.json
  3. Or paste the JSON content directly
Configure dbt Options:
FieldRecommended Value
Update DescriptionsEnabled
Update OwnersEnabled
Include TagsEnabled
Classification NamedbtTags
Test & Deploy:
  1. Click Test Connection
  2. If successful, click Deploy
  3. Click Run to trigger immediately

Verification

After running the full pipeline, verify:
CheckHow to VerifyExpected Result
GCS artifacts existgsutil ls gs://bucket/dbt/manifest.json, catalog.json listed
Ingestion completedCollate UI → Service → Ingestion tabGreen status, no errors
Lineage appearsClick on a dbt model → Lineage tabUpstream/downstream connections
Descriptions syncedClick on a table → Schema tabColumn descriptions visible
Tags appearClick on a table → Tags sectiondbt tags shown

Cloud Composer Specific Setup

Upload Service Account Key to Composer

export COMPOSER_ENV="your-composer-env"
export COMPOSER_LOCATION="us-central1"

# Get Composer bucket
COMPOSER_BUCKET=$(gcloud composer environments describe ${COMPOSER_ENV} \
    --location ${COMPOSER_LOCATION} \
    --format="get(config.dagGcsPrefix)" | sed 's|/dags||')

# Upload service account key
gsutil cp ~/dbt-sa-key.json ${COMPOSER_BUCKET}/dbt/dbt-sa-key.json

Set Environment Variables

gcloud composer environments update ${COMPOSER_ENV} \
    --location ${COMPOSER_LOCATION} \
    --update-env-variables \
        DBT_PROJECT_DIR="/home/airflow/gcs/dbt/my_project",\
        GCS_BUCKET="your-company-dbt-artifacts",\
        GCS_PREFIX="dbt",\
        GOOGLE_APPLICATION_CREDENTIALS="/home/airflow/gcs/dbt/dbt-sa-key.json"

Troubleshooting

IssueSymptomCauseSolution
Access Denied”403 Forbidden” errorInsufficient permissionsVerify service account has storage.objectViewer role
Bucket Not Found”404 Not Found”Bucket name incorrectCheck bucket name matches actual bucket
Invalid Credentials”Authentication failed”Wrong service account keyVerify JSON key is for correct project and SA
No objects foundArtifacts not appearingWrong prefix or upload failedCheck GCS_PREFIX matches upload path
Stale dataOld lineage/descriptionsOld artifacts in GCSVerify dbt DAG uploads fresh artifacts

Next Steps

See other storage options: S3 | Azure | HTTP | Local | dbt Cloud