Skip to main content

Collate Automations

Overview

Collate’s Automation feature is a powerful tool designed to simplify and streamline metadata management tasks. By automating repetitive actions such as assigning owners, domains, or tagging data, Collate helps maintain consistency in metadata across an organization’s datasets. These automations reduce manual effort and ensure that metadata is always up-to-date, accurate, and governed according to predefined policies.

Why Automations are Useful

Managing metadata manually can be challenging, particularly in dynamic environments where data constantly evolves. Collate’s Automation feature addresses several key pain points:
  • Maintaining Consistency: Automation helps ensure that metadata such as ownership, tags, and descriptions are applied consistently across all data assets.
  • Saving Time: Automations allow data teams to focus on higher-value tasks by eliminating the need for manual updates and maintenance.
  • Enforcing Governance Policies: Automations help ensure that data follows organizational policies at all times by automatically applying governance rules (e.g., assigning data owners or domains).
  • Data Quality and Accountability: Data quality suffers without clear ownership. Automating ownership assignments helps ensure that data quality issues are addressed efficiently.

Key Use Cases for Collate Automations

1. Bulk Description

Getting started with Automation
  • Problem: Many datasets lack descriptions, making it difficult for users to understand the data’s purpose and contents. Sometimes, the same column description needs to be added to multiple datasets.
  • Solution: Automations can bulk-apply descriptions to tables and columns, ensuring that all data assets are consistently documented.
  • Benefit: This use case improves data discoverability and understanding, making it easier for users to find and use the data effectively.
For the Action Configuration:
  • Apply to Children: Lets you apply the description to the selected child assets (e.g., columns) within an asset.
  • Overwrite Metadata: Allows you to overwrite existing descriptions with the new description. Otherwise, we will only apply the description to empty tables or columns.

2. Bulk Ownership and Domain Assignment

Getting started with Automation
  • Problem: Many data assets lack proper ownership and domain assignment, leading to governance and accountability issues. Manually assigning owners can be error-prone and time-consuming.
  • Solution: Automations can bulk-assign ownership and domains to datasets, ensuring all data assets are correctly categorized and owned. This process can be applied to tables, schemas, or other assets within Collate.
  • Benefit: This use case ensures data assets have a designated owner and are organized under the appropriate domain, making data more discoverable and accountable.
For the Action Configuration:
  • Overwrite Metadata: Allows you to overwrite existing owner or domain with the configured one. Otherwise, we will only apply the owner or domain to assets that do not have an existing owner or domain.

3. Bulk Tagging and Glossary Term Assignment

Getting started with Automation
  • Problem: Manually applying the same tags or glossary terms to multiple datasets can be inefficient and inconsistent.
  • Solution: Automations allow users to bulk-apply tags (e.g., PII) or glossary terms (e.g., Customer ID) to specific datasets, ensuring uniformity across the platform.
  • Benefit: This automation reduces the risk of missing important tags like PII-sensitive and ensures that key metadata elements are applied consistently across datasets.
For the Action Configuration:
  • Apply to Children: Lets you apply the Tags or Glossary Terms to the selected child assets (e.g., columns) within an asset.
  • Overwrite Metadata: Allows you to overwrite existing Tags or Terms with the configured one. Otherwise, we will add the new Tags or Terms to the existing ones.

4. Bulk Test cases Assignment

Add Test Cases Remove Test Cases
  • Problem: Manually assigning or removing test cases for individual data assets is time-consuming and error-prone, especially at scale. For teams managing hundreds or thousands of assets, this repetitive process creates friction, reduces consistency, and delays the rollout of standardized data quality checks.
  • Solution: The Add Test Cases and Remove Test Cases actions in Automator allow users to manage test case assignments in bulk. Instead of individually configuring test cases on each asset, users can apply or remove a common test case (of the same type and configuration) across all filtered data assets in a single step.
Apply to Children option allows you to apply test cases at the column level instead of the entire table. When enabled:
  • You can specify and select individual column names from the table.
  • Only column-level test definitions are applied (table-level tests are automatically excluded).
  • Before being added, tests verify column data type compatibility to ensure accuracy.
This functionality is not designed to trigger actions based on test case results (e.g., fail/pass/aborted).
For foundational concepts on test cases and their role in Collate’s metadata model, refer to: Test Cases in Collate.
  • Benefit:
    • Saves manual effort by enabling one-click bulk operations (add or remove) on test cases.
    • Enforces standardization of similar data quality checks across filtered assets.
    • Reduces human error and speeds up test deployment.
    • Helps maintain consistency in validation strategies across domains, asset types, or tags.

Action Configuration:

Add Test Cases Apply the same test case configuration to all filtered data assets in one go. This is useful for bulk-enforcing validation rules like “not null”, “regex match”, etc. Example Use Case:
  • Apply a “not null” test to every Column tagged as Sensitive.
Remove Test Cases Remove a specific test case (with a defined type/config) from all filtered data assets at once. This is ideal for cleaning up deprecated or incorrectly assigned tests. Example Use Case:
  • Remove all “row count threshold” test cases from tables within a deprecated domain.
Supported Assets: Test cases currently only work with Table data assets. Other entities such as Topics, Containers, and Pipelines do not support test cases at this time.

5. Metadata Propagation via Lineage

Getting started with Automation
  • Problem: When metadata such as tags, descriptions, or glossary terms are updated in one part of the data lineage, they may not be propagated across related datasets, leading to inconsistencies.
  • Solution: Use automations to propagate metadata across related datasets, ensuring that all relevant data inherits the correct metadata properties from the source dataset.
  • Benefit: Metadata consistency is ensured across the entire data lineage, reducing the need for manual updates and maintaining a single source of truth.
For the Action Configuration:
  1. First, we can choose if we want the propagation to happen at the Parent level (e.g., Table), Column Level, or both. This can be configured by selecting Propagate Parent and/or Propagate Column Level.
  2. Then, we can control which pieces of metadata we want to propagate via lineage:
    • Propagate Description: Propagates the description from the source asset to the downstream assets. Works for both parent and column-level.
    • Propagate Tags: Propagates the tags from the source asset to the downstream assets. Works for both parent and column-level.
    • Propagate Glossary Terms: Propagates the glossary terms from the source asset to the downstream assets. Works for both parent and column-level.
    • Propagate Owners: Only applicable for Parent assets. Propagates the owner information to downstream assets.
    • Propagate Tier: Only applicable for Parent assets. Propagated the tier information to downstream assets.
    • Propagate Domain: Only applicable for Parent assets. Propagates the domain information to downstream assets.
As with other actions, you can choose to Overwrite Metadata or keep the existing metadata and only apply the new metadata to assets that do not have the metadata already.

Advanced Propagation Controls

Automation advanced propagation controls
Propagation Depth
The Propagation Depth feature allows you to control how far metadata propagates through your lineage graph. It includes two modes:
  • Root (Default)
  • Data Asset (New)
Propagation Depth Mode only works when the Propagation Depth field is set. Leaving this field blank would complete the propagation from the root.
Propagation Depth Mode - Root (Default)
Propagation Depth Root Mode Description: This mode calculates depth starting from root entities (nodes with no upstream lineage). It is the default behavior. Use Case: Used when you want to limit metadata propagation of the lineage graph downstream considering root as reference for depth-count. Configuration:
  • Only valid if Propagation Depth value is set (must be a positive integer e.g. 1, 2, 3)
    • Limits metadata propagation from the root to that many levels.
  • Select Root as Propagation Depth Mode
How it works:
  • Depth starts at 0 for the root node.
  • Each downstream node increases depth by +1.
  • Depth is evaluated per path.
Example: Lineage: A → B → C → D
  • A (Depth 0 - root)
  • B (Depth 1)
  • C (Depth 2)
  • D (Depth 3)
If Propagation Depth = 2, metadata will only reach B and C. D will be excluded. Advanced Example with Multiple Parents: Lineage:
  • A → B → C
  • D → C
If Propagation Depth = 1:
  • C receives metadata from D (depth = 1)
  • C does not receive metadata from A (A → B → C = depth 2)
When to use:
  • You want a strict top-down metadata flow from root entities
  • Your lineage tree is deep and you want to control data flow scope
  • You aim to optimize performance in large graphs
Propagation Depth Mode - Data Asset (New)
Propagation Depth Data Asset Mode Description: This mode allows propagation to be limited based on distance from the target asset (i.e., the asset being propagated to). Depth is calculated in reverse – from the selected asset upstream. Use Case: Used when you want to propagate metadata to specific data assets and only include metadata from a fixed number of upstream levels. Configuration:
  • Only valid if Propagation Depth value is set (must be a positive integer e.g. 1, 2, 3)
    • Limits metadata propagation values to data-asset from that many levels.
  • Select Data Asset as Propagation Depth Mode
How it works:
  • Depth starts at 0 from the target asset (e.g., D)
  • Each upstream node increases depth by +1
  • Only upstream nodes within the defined depth are considered for propagation
Example: Lineage: A → B → C → D Target: D Propagation Depth: 2 Mode: Data Asset
  • D (Depth 0 - target)
  • C (Depth 1)
  • B (Depth 2)
  • A (Depth 3)
Only B and C will contribute metadata. A will be excluded as it’s in depth more than 2 from D as reference. Comparison with Root Mode (same example):
  • In Root Mode, propagation starts from A and flows downwards
  • In Data Asset Mode, propagation still flows downwards but depth calculation starts from D.
  • The flow of metadata is always upstream → downstream. Only the depth calculation changes
When to use:
  • When propagating metadata to a specific data asset and only want recent (proximate) sources
  • For more targeted metadata control in complex graphs
Notes
  • Propagation Depth Mode selection is only applicable if a numeric depth is specified.
  • If Propagation Depth is empty, propagation is unrestricted, and the mode selection is ignored.
  • Default behavior (if no mode selected) is Root Mode.
Stop Propagation Conditions
The Stop Propagation feature lets you halt metadata flow when certain conditions are matched (e.g., sensitive data markers are encountered):
  • Use Case: Prevent metadata propagation at specific condition.
  • Supported Attributes:
    • description: Stop when specific description text is found
    • tags: Stop when specific tags are present
    • glossaryTerms: Stop when specific glossary terms are found
    • owner: Stop when specific owners are assigned
    • tier: Stop when specific tier levels are encountered
    • domain: Stop when specific domains are assigned
Important Note: When a stop condition is matched at a node, the propagation stops AT that node. The node retains its original metadata, and propagation does not continue to its downstream assets. Examples:
  1. Sensitive Data Boundaries: Stop at nodes tagged as “Confidential” or “PII-Sensitive”
  2. Organizational Boundaries: Halt at assets owned by specific teams
  3. Domain Transitions: Stop when crossing into different business domains
  4. Quality Thresholds: Pause at specific tier levels
How it works:
  • The system evaluates metadata at each node during propagation
  • When matching any specified condition, propagation stops at that node
  • Intelligent matching handles various formats (HTML in descriptions, object types)
  • Existing metadata at the stop point remains unchanged
Configuration Example:
Stop Conditions Examples:
  - Metadata: Tags
    Values: ["PII.Sensitive", "Confidential"]
  - Metadata: Domain
    Values: ["Finance", "Legal"]
  - Metadata: Description
    Values: ["DO NOT PROPAGATE"]

5. Automatic PII Detection and Tagging

Getting started with Automation
Note that we recommend using the Auto Classification workflow instead, which allows you to discover PII data automatically, even in cases where you don’t want to ingest the Sample Data into Collate.Note that this automation, the ML Tagging, will be deprecated in future releases.
  • Problem: Manually identifying and tagging Personally Identifiable Information (PII) across large datasets is labor-intensive and prone to errors.
  • Solution: Automations can automatically detect PII data (e.g., emails, usernames) and apply relevant tags to ensure that sensitive data is flagged appropriately for compliance.
  • Benefit: Ensures compliance with data protection regulations by consistently tagging sensitive data, reducing the risk of non-compliance.

Best Practices

  • Validate Assets Before Applying Actions: Always use the Explore page to verify the assets that will be affected by the automation. This ensures that only the intended datasets are updated.
  • Use Automation Logs: Regularly check the Recent Runs logs to monitor automation activity and ensure that they are running as expected.
  • Propagate Metadata Thoughtfully: When propagating metadata via lineage, make sure that the source metadata is correct before applying it across multiple datasets.
  • Start with Controlled Propagation: For complex and large lineage trees, begin the propagation with a limited propagation depth (e.g., 2-3 levels/depth) and gradually increase as needed to avoid unintended widespread changes.
  • Understand Path-Aware Depth Behavior: In complex lineage with multiple parent paths, remember that propagation depth is calculated separately for each path from each root entity. This ensures precise control over which upstream sources contribute metadata to downstream assets.
  • Set Up Stop Conditions for Critical Data: Cofigure strategic stop conditions around critical ownership boundaries or sensitive data boundaries (Tags- PII, Confidential) to prevent accidental metadata overwrites.