
S3 Storage
PRODHow to Run the Connector Externally
To run the Ingestion via the UI you’ll need to use the OpenMetadata Ingestion Container, which comes shipped with custom Airflow plugins to handle the workflow deployment. If, instead, you want to manage your workflows externally on your preferred orchestrator, you can check the following docs to run the Ingestion Framework anywhere.Requirements
OpenMetadata 1.0 or later
To deploy OpenMetadata, check the Deployment guides.
S3 Permissions
For all the buckets that we want to ingest, we need to provide the following:s3:ListBuckets3:GetObjects3:GetBucketLocations3:ListAllMyBucketsNote that theResourcesshould be all the buckets that you’d like to scan. A possible policy could be:
CloudWatch Permissions
Which is used to fetch the total size in bytes for a bucket and the total number of files. It requires:cloudwatch:GetMetricDatacloudwatch:ListMetricsThe policy would look like:
Python Requirements
To run the Athena ingestion, you will need to install:OpenMetadata Manifest
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level metadata from buckets, but in order to understand their internal structure we need users to provide anopenmetadata.json
file at the bucket root.
Supported File Formats: [ "csv", "tsv", "avro", "parquet", "json", "json.gz", "json.zip" ]
You can learn more about this here. Keep reading for an example on the shape of the manifest file.
OpenMetadata Manifest
Our manifest file is defined as a JSON Schema, and can look like this:Global Manifest
You can also manage a single manifest file to centralize the ingestion process for any container, namedopenmetadata_storage_manifest.json.
You can also keep local manifests openmetadata.json in each container, but if possible, we will always try to pick up the global manifest during the ingestion.