Syncing CSV Files from AWS S3
Load CSV data from an AWS S3 bucket into Snowflake using Matillion when a direct connection to the source application is not available.
Matillion cannot pull data directly from some on-premises external applications because they do not accept requests from external sources. In such cases, push data from the on-premises application to an AWS S3 bucket. Matillion then fetches the data from the S3 bucket and loads it into Snowflake.
Because CSV files are a list of values separated by commas, Matillion assumes that the first value maps to the first column, the second value to the second column, and so on. Matillion cannot determine column names from the file structure: it only knows the position of each value. You must create a Snowflake table with columns that match both the names and positions of the fields in the CSV file before loading data.
Perform the following steps to sync CSV data from AWS S3 into Snowflake.
1. On the Matillion project page, click the More tab, then click Cloud credentials.
2. On the Cloud credentials page, click Add cloud credential. Select aws as the type. Enter the following fields:
Credential name: A descriptive name for the AWS credential.
Access key ID: The AWS access key ID.
Secret access key: The AWS secret access key.
Save the cloud credential, then associate it with the existing project environment.
3. Add a Create Table component to the orchestration pipeline. Define columns that match the names and positions of the fields in the CSV file. Configure the following properties:
Create Method: Select one of the following options:
Create: Creates a new table. Generates an error if a table with the same name already exists. Existing data is not destroyed.
Create if not exists: Creates a new table only when a table of the same name does not already exist. Does not destroy existing data.
Replace: Drops any existing table of the same name and creates a new table. Data from the replaced table is lost.
Table Type: Select one of the following options:
Permanent: Holds data indefinitely. This table type can be restored using Snowflake Time Travel.
Transient: Holds data indefinitely. This table type cannot be restored.
Columns: Define the columns of the table. Ensure their positions match the positions of the columns in the CSV file.
4. Add an S3 Load component after the Create Table component. Configure the following properties:
Authentication: Select one of the following options:
Credentials: Uses the AWS security credentials created in step 2.
Storage Integration: Uses a Snowflake storage integration object that stores a generated IAM entity for external cloud storage.
S3 Object Prefix: Enter the container path where the S3 bucket is located, or select from the list of S3 buckets using the file explorer.
Pattern: A string that partially matches all file paths and names to include in the load. Defaults to .*, which includes all files within the S3 Object Prefix. The pattern matches the complete file path, not just the path relative to the configured prefix.
Target Table: Select the table created in step 3.
Skip Header: Enter the number of header rows to skip. The default is 0. Set to 1 if the CSV file contains a header row.
Purge Files: When set to True, purges the data files from S3 after the data is successfully loaded. The default is False.
Force Load: When set to True, loads all files regardless of whether they have been loaded previously and have not changed since. This can lead to duplicated data. The default is False.
5. Add a Run Transformation component to transform the loaded data into the PTC Orbit format. Configure field mappings in the transformation pipeline using the Calculator component.
6. Add an SQL Script component to validate and group the stage records. Execute the appropriate stored procedures for the target object type.
7. Validate the pipeline by clicking Validate. If validation fails, click the component with the error icon to view the detailed error message. Fix the error, then click Run to execute the pipeline.
The CSV data from AWS S3 is loaded into Snowflake, transformed into the PTC Orbit format, and saved to the stage table for downstream processing.
Was this helpful?