Categories
Analytics AWS

AWS — ETL transformation

Introduction

ETL (Extract, Transform, and Load) data process to copy data from one or more sources into the destination system. Data-warehousing projects combine data from the different source systems or able to read a portion of data from a large set. Data is stored in a flat-file will be converted into Quarriable format at the presentation layer where analytics can be performed.

Meanwhile, AWS glue will be used for transforming data into the requested format.

Image for post
Architecture Design (image-1)

Extract

Data is placed in the S3 bucket as a flat-file with CSV format. These files or files will get transformed by glue.

The input file to test can be download from below link —

Transform

AWS Glue used to fetch data from flat files and create a Glue table using that file. I am working on a very small set of datasets with “US National Park” information. I have updated some of the data with additional fields that will break in normal transformation.

Input data — This has listed all US National Parks and its information

Image for post
image-2

Yellowstone Park has spanned across 3 states so quotes will be used to consider that field as part of a single-column “state”. The comma (“,”) as a separator will cause issues. State value can be just “‘Idaho” rather than “Idado, Montana. Wyoming”. So we need to transform this data into a meaningful one.

Image for post
image-3

Load

Upon creating a table from a flat-file. AWS Athena will be able to fetch data for analytics purposes or load it into a new environment. I am using “AwsDataCatalog” as an Athena data connector for AWS Glue.


Implementing using Terraform

Terraform software tools can be used to deploy infrastructure as code. We can change infrastructure efficiently and repeatedly with changing input parameters.

Following information needed –

· S3 bucket information and folder name where input data is present. As a best practice, I have added my bucket information into the parameter store rather than adding it into code. Make sure parameter “s3bucket” is created with value <s3://bucketname/>

Image for post

· Glue Database and table name

· Table Schema matching input parameter

Provider information. I have intentionally added the “access_key” and “secret_key” parameter and hashed it out to ensure no one should add that information in provider information instead create AWS profile using “aws configure” command. My profile name is “ya”. A profile can be different for different application or implementation.

Image for post

Download source code of Terraform from below location –

https://github.com/yogeshagrawal11/cloud/tree/master/aws/athena/s3_to_athena/

Once source code is downloaded in the same directory as terraforming software. run following command —

aws configure : To configure AWS profile. Make sure appropriate roles are assigned.

terraform init : to initialize terraform

terraform plan : To review plan

terraform apply : To deploy your infrastructure

Run terraforming code to deploy as Infrastructure as code. This code will create two resources, AWS Glue database and AWS Glue Table

Image for post

Implementing using AWS console

AWS Glue

I am not using the crawler here since this document is the first step in the ETL process. The crawler is needed in case input data is not static. We need to create a schedule to run crawler periodically for new data.

Create the AWS Glue database

Image for post
Image for post

Create the AWS Glue table. The location is the S3 input location. Select S3 bucket and folder name where input data is stored. Add Glue table name. Select the appropriate Glue database name.

Image for post
Image for post

Select S3 as a type of source and bucket name by clicking the folder image next to the text box

Image for post

Select classification and Delimiter as per inputs. Avro and Parquet is a new growing data format technique. JSON still widely excepted by many IoT devices worldwide as output technique. Select the appropriate classification and delimiter. I

Image for post

Add all column

Image for post
Image for post
Image for post

Table properties

Since my data do have a double quote, I have to transform that data from multiple fields into a single field using quote Character. The default “serde serialization” library is also changed to openCSVSerde.

Modify table property Add appropriate Serde Serialization and Serde Pameters as mentioned below. Each parameter will de define how each row is parsed. If quote char found in a field, that field will be a single entity until closing quotes. For ex. in case of “Yellowstone” national park.

Image for post
Image for post

Column and Data type can be defined or automatically generated. Defining manually normally does have better results.

Image for post

Athena Magic

Use data source at “AwsDataCatalog” to connect Athena with AWS Glue.

Image for post

Enjoy running queries on data.

Image for post

Most National Park by state.

Image for post

Athena is a good tool to get data from different connector like AWS Document DB, AWS DynamoDB, Redshift, or even Apache HBase. That will be another story.

Conclusion

Amazon glue is a great tool to perform different types of ETL operations. Use a crawler schedule for running and updating data periodically. AWS makes things movement seamless. New jobs can be scheduled via Scala or Python.

Resources :

https://www.terraform.io/intro/index.html

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

Categories
AWS Compute

AWS Lambda 101

Lambda function Introduction

Lambda function is AWS offering commonly known for Function as a Service. AWS Lambda function help running code without provisioning or managing underline server. Many languages are supported by the Lambda function and the list is keep on growing. As of Jul2020, .Net, Go, Java, Node.js, Ruby, and my personal favorite Python among supported languages. Lambda is developed with high availability in mind and It is capable to scale during burstable requests.

We need to grant access to the Lambda function as per its use. Normally access is granted via the IAM role.

IAM Policy

Create a policy that will be able to create a log group and log stream. This is the basic execution rule required for the lambda function. Without this access, the Lambda function will not able to generate logs. Logging can be used for custom triggering events or tracking\debugging purposes.

Image for post
image-1
Image for post
image-2

IAM Role

The role is created to grant permission for specific tasks. In case, the function needed to access S3 get access to add appropriate policy into a role or create a custom policy. Always grant Least Privilege to function as per AWS security best practice.

Image for post

Select newly created policy

Image for post
image-4

Attach appropriate policy(image-5). Adding role description and tags are good practice in IAM. Click on Create Role.

Image for post
image-5

Lambda function

To create Lambda function, Goto Services and select Lambda.

Click the Lambda function.

Image for post
image-6

We have three options to choose from. Simplest on “Author from scratch”. In this option we will create “Hello world function” will also verify logging is working as per expectation.

“Use a blueprint”: AWS already created lots of useful functions that we can use to get started. Like, returns current status on AWS Batch job or retrieve an object from S3.

“Browse serverless app repository”: This will deploy sample Lambda application from different application repository. We also can use a private repository to pull code from.

Select an appropriate runtime environment. A select role that we have created in image-5.

Image for post
image-7

The designer will guide you on how the Lambda function is triggered. It can be triggered by different events like SNS topics, SQS, or event cloud watch logs. There are multiple different ways to trigger the Lambda function.

The Lambda function can be used for batch-oriented work or scripting purposes, you can use cloud watch rule to trigger cloud function using crontab at scheduled interval.

A destination can be an SQS event or SNS topic or event cloud watch log stream. We can also upload a read file from S3 and upload it to S3 after performing transformation within the Lambda function.

Image for post
image-8
Image for post
image-8a

This Hello world function is very simple, if it’s invoked from webhook it will return status code 200 with the body “Hello from Lamda!” It also writes log into the log stream. “event” and “context” been used to get input values as well as get HTTP context information that invokes the Lambda function.

An environment variable can be used to pass any static parameter to function like incase of downloading a file from S3 bucket, bucket name. Or while writing data into Dynamodb database. Its table name.

For security reasons, do not add your “Access Key” or “Secret key” values as an environment variable. One shall still use an encrypted parameter store for this purpose.

Image for post
image-9

The handler is the most important config parameter in the Lambda function. It has two parts separated by period (.)(image-10). The first part is nothing but a file name and the second part is a function definition that will read when the Lambda function is invoked(image-8a). I kept handler value default but I always recommend giving some meaning full name.

Memory is the amount of memory dedicated to Lambda function this depends upon activity you are performing and can be changed.

Timeout value determines how much time this function runs before times out. If the activity you like to perform will take more than the timeout value specified the Lambda function will be abruptly stopped. So, give some buffer time in the timeout value.

Image for post
image-10

Be default, the Lambda function does not require to be part of any VPC but in case Lambda functions needed to be able to communicate with EC2 instances or on-premise environment for data communication or invoking lambda function from EC2 instance we needed VPC configuration. You can still trigger the Lambda function using AWS SDK with VPC configuration.

It’s very common to store output data generated via Lambda function and store into Elastic Filesystem(EFS) since the Lambda function does not have static storage. We can use temporary ephemeral storage that lasted till the execution of function so EFS can be used to store all output.

Image for post
image-11

Permissions tab allows you to verify actions permitted for Lambda function on a given resource. The dropdown can be used to select on multiple resources. As per the below screenshot, the Lambda function can create Log group, Log stream, and able to put log messages on Amazon cloud watch logs.

Image for post
image-12

We can trigger the lambda function with different triggers. Here, I am creating a test event that will trigger the lambda function. Click on “Configure test events”. We are not sending any input value to function while invoking it. Key-value pair can be used (image-14) to send values to the Lambda function.

Image for post
image-13
Image for post
image-14

Once an event is created click on “Test” to invoke the Lambda function.

Image for post
image-15

The log output is shown below. Click on the “logs” URL to open the Cloud watch log group created by the Lambda function. This Log group will be available.

Image for post
image-16

Lambda event will create either a new Log stream or update into the existing log stream. Logs are put into the Log stream(image-17). Each log stream consists of many logs.

Image for post
image-16
Image for post
image-17

Deleting Lambda function

To delete the lambda function just select the Lambda function click on Actions and Delete.

Image for post
image-18
Image for post
image-19

The Lambda function successfully deleted.

Image for post
image-20

Cloud watch Log group and streams are not deleted by default. You can delete logs from cloud watch or exported into S3 for a cheaper cost.

To delete the log group. Go to Cloud watch select logs -> Log groups and select the appropriate log group names in the below format. Click on Action and delete Log groups.

/aws/lambda/<functionname>

Image for post
image-21
Image for post
image-22

Conclusion

Lambda function is the best way to run short jobs or create a script to run. This can each work with a webhook API or SNS/SQS environment. Have fun exploring multiple lambda function usage.