Categories
AWS Compute Database Featured Management Security & Identity

Convert object-oriented data to Nosql Dynamodb — 101

The IoT Ecosystem is buzz words and needed lots of data management. We receive data but how to make use of data is the most important. This design is a very small portion of a bigger portfolio. Much more application can be integrated into this design. There are many ways to perform this transformation. Athena and Glue certainly can be used here.

Design overview

Consider this design is a bare minimum requirement to convert object-oriented data into data used for analytics. I am trying to use managed service as much as possible in this design but the 3rd party tool can be used for this design.

Application or IoT device will dump data into the S3 bucket. Data can have a variable field as long as the name file is common. S3 will trigger the Lambda function upon put request completion. This Lambda function will download files from the S3 bucket and copy in its temporary storage configured via lambda. To make historic trending of time-series data, Dynamodb hash key can be used with a combination of “name” and “timestamp”.

Lambda will convert the CSV file into JSON and add each row as an item into Amazon DynamoDB Table. Upon success, lambda will send a notification to SNS topics. SNS topic is configured with two types of subscriptions “SMS” and “SQS”.

Failure events can be sent to another topic for reiteration.

Image for post
Architechture Diag (image 1)

Use Case:

This code can be used with a little tweak to get a set of S3 data and perform analysis on them (Instead of put trigger use copy or post-event). Say some team likes to perform analytics from all data for last month. We can use this type of environment and provide the DynamoDB database for this specific analysis. Once work is done all configuration will be done.

Infrastructure as code (IaC)

IaC is one of the most important application deployment tools. It will reduce errors and provide highly repeatable infrastructure. This will help me not to manually configuring parameters. All parameter resource names are prefixed with “appname” variable. Thus, the same configuration can be used for different application environments or teams.

I chose Terraform to implement this so that a hybrid implementation is an option in case of any customer requirement. Terraform support all major cloud environment. Obviously, we need to appropriately change the resources.

Terraform Provider information. I highly recommend setting up a profile while running the “terraform init” command so that different environments used with different access.

Image for post

Avoid using “access_key” and “secret_key”. You can also create ec2 instance with proper IAM role for Terraform deployment.

Following resource configuration will be added into the environment for this implementation –

  • app_lambda_role

Lambda function will use this role for internal usage. Mainly, this role should include S3 read access, Cloud watch log group and stream write access, Dynamodb add/read/update item access and SNS publish access.

  • app-lambda-cloud-watch-policy

Policy created with the above role access.

  • app-lambda-cloud-watch-policy-attachment

Attach the policy to the role.

  • allow_bucket

This will be used to trigger the lambda function from S3.

  • app-lambda-func

Lambda function will be run after triggered by S3.

  • bucket_notification

S3 notification resource that will trigger lambda function on events specified in the notification. “prefix” and “suffix” configuration can be used for different types of environments.

  • app-snstopic

SNS topic where lambda function will send notification of successful events. PS. I have not configured notification on failure events. Create another topic for the same and update the lambda code accordingly.

  • app-sns-target

SNS-target will connect “SQS” as a subscription for “app-snstopic”

  • app-snstopic-sms

SMS topic is created. We can club topics with just another SMS subscription. I wanted to ensure we have different topics to send different kinds of data. Like for SQS, we can send information about which rows are failed and try that row information. SMS topic will have concise information.

  • app-sms-target

Sms-target topic will connect SMS phone no or list of phone nowhere an event is sent.

  • app-sqs

Queue with information. This can be used to notify topics that are not successful. Lambda function can be triggered to resolve those issues or try again. I have not added that functionality.

  • app-dynamodb-table

The table will be created as per input schema. Hash-key is important and all input data should have hash-key. If hash-key is not present then the item will not be inserted into the Dynamodb NoSQL database. In my input, “name” field is used as hash-key

Source code

Download source code from below Github link –

https://github.com/yogeshagrawal11/cloud/tree/master/aws/DynamoDB/S3%20to%20Dynamodb

Download zip file and main.tf and terraform.tfvars. Change appropriate values in “terraform.tfvars” file.

Image for post

Download zip file at the same location as the terraform key. Lambda function will be created by the below terraform resource.

Image for post

Terraform apply command Output

The following resource will be created using

“terraform apply -auto-approve“ Command. All 12 resources will be created.

Image for post

Lambda function created.

Image for post

Input file uploaded to s3.

Image for post

Input file format.

Image for post

Dynamodb table created by terraform.

Image for post

Lambda function triggers after uploading the input file.

Image for post

Data inserted into nps_parks table by insertS3intoDynamodb lambda function

Image for post

SNS topic created

Image for post

SQS queue

The message is posted into SQS queue

Image for post

Next Step

  • Add an application to analyze data from DynamoDB and present virtualization information.
  • Add realist changeable data, not static data that I used in my case study.

Disclaimer

1. Code is available with Apache license agreement

2. Do not use this code for production. Educational purposes only.

3. Needed to improve security around the environment

4. Tighten IAM policy required for production use

5. I have not created a topic for the failure event

6. A failure domain is not considered in this design

7. Lambda function is created with base minimum code and not performing data validation

Categories
AWS Compute

AWS EC2 instance management — Starting and Storage instances using tagging

Perform shutting and starting of instances during non-business hours to reduce operating cost.

Architecture Diagram

AWS Lambda function will be called using CloudWatch rule at regular intervals every hour. Lambda function will review tags of each instance and verify weather instance needed shutdown or startup depend upon local time. Lambda code can be placed on S3 for safekeeping to import from.

New Log stream and group created to keep track of shutdown and startup events. Amazon SNS topics can be used for event notification. A custom role is created to get EC2 instance information and write logs to the CloudWatch event.

Instance tagging will be used to determine the instance needed to shut down or startup.

Image for post
Architecture Diag. Image 1

Tagging format

Key name: “shutdown”

Values format :

weekday:<hours to shutdown separated by hours>;weekend: <hours to shutdown separated by hours>;

weekday: Monday to Friday

weekend: Saturday and Sunday

This format can be changed but code change needed for that.

PS. Keywords include colon(:) and semicolon(;)

Role Policy

Create Role policy with the following permissions.

Image for post
image 2

Create a role and attach it to the newly created policy.

Image for post
image 3

Lambda Function

Create a Lambda function with runtime python3.8. A select newly created role for the Lambda function.

Image for post
image 4

Download Archive.zip file code from my GitHub below link –

https://github.com/yogeshagrawal11/cloud/tree/master/aws/EC2/instance_shutdown_script

To add function code. Click on Action and upload a zip file.

Image for post
image 5

After uploading a zip file you would see “lambda_function.py” is an actual script for instance. I am using the default handler in the script. In case, handler information is changed please make appropriate changes either in function name or file name.

Filename: lambda_function.py

Python function name: lambda_handler

Image for post
image 6

I am using “pytz” module to get the correct timezone conversion so needed to upload that module code.

For regions to timezone mapping, I have created a dictionary. In case your region is not part of the list feel free to add a new key with region name and respective timezone.

Image for post
image 7

In our magical earth, sunset and sunrise are happening at the same time for different regions. So I am calling function to perform both activities simultaneously. “system_function” will generate a list of all instances that need to stop and start.

Image for post
image 8

In the case of the large instance count, I am using a paginator.

Image for post
image 9

Paginator will reduce the list by filtering tag value. If the “shutdown” tag is not set then the instance will not be part of the list. Using the same function for starting and shutting down instances. Hence, for shutting down instances I am also filtering for running instances. Or else to start instances I am filtering for stopped instances.

Image for post
image 10

Set a timeout value of more than 3sec. Normally it takes me 10 to 15 sec. This value can vary with the environment and no of instances.

Please use the following handler or make any appropriate changes.

Lambda handler = lambda_function.lambda_handler

Image for post
image 11

Lambda Permissions :

Lambda permission for creating log stream/log group and adding events in cloud watch.

Image for post
image 12

Lambda function also has permission to get instance status, able to read instance tags. Start and stop instances.

Image for post
image 13

Cloud watch Rule

Create a CloudWatch rule that will trigger the Lambda function every start of the hour every day.

Image for post
image 14
Image for post
image 15
Image for post
image 16

Testing

Current time: Fri 4:18 pm PST

In the Oregon region, I have instance1 will be shut down and instance 2 will be started.

Image for post
image 17

In North Virginia region, east 2 and east 3 should be shutdown(+3hr)

Image for post
image 18

Manually run Lambda function

Image for post
image 19

Logwatch events

Image for post
image 20

Oregon instances Started and stopped as expected

Image for post
image 21

North Virginia instances stopped

Image for post
image 22

Mumbai region instance closed as well

Image for post
image 23

Disclaimer :

SNS topic is not triggered but can be added for any stopping and starting instances.

Database can be used to track no of hours saved due to starting and stopping Ec2 instance script

Categories
Analytics AWS

AWS — ETL transformation

Introduction

ETL (Extract, Transform, and Load) data process to copy data from one or more sources into the destination system. Data-warehousing projects combine data from the different source systems or able to read a portion of data from a large set. Data is stored in a flat-file will be converted into Quarriable format at the presentation layer where analytics can be performed.

Meanwhile, AWS glue will be used for transforming data into the requested format.

Image for post
Architecture Design (image-1)

Extract

Data is placed in the S3 bucket as a flat-file with CSV format. These files or files will get transformed by glue.

The input file to test can be download from below link —

Transform

AWS Glue used to fetch data from flat files and create a Glue table using that file. I am working on a very small set of datasets with “US National Park” information. I have updated some of the data with additional fields that will break in normal transformation.

Input data — This has listed all US National Parks and its information

Image for post
image-2

Yellowstone Park has spanned across 3 states so quotes will be used to consider that field as part of a single-column “state”. The comma (“,”) as a separator will cause issues. State value can be just “‘Idaho” rather than “Idado, Montana. Wyoming”. So we need to transform this data into a meaningful one.

Image for post
image-3

Load

Upon creating a table from a flat-file. AWS Athena will be able to fetch data for analytics purposes or load it into a new environment. I am using “AwsDataCatalog” as an Athena data connector for AWS Glue.


Implementing using Terraform

Terraform software tools can be used to deploy infrastructure as code. We can change infrastructure efficiently and repeatedly with changing input parameters.

Following information needed –

· S3 bucket information and folder name where input data is present. As a best practice, I have added my bucket information into the parameter store rather than adding it into code. Make sure parameter “s3bucket” is created with value <s3://bucketname/>

Image for post

· Glue Database and table name

· Table Schema matching input parameter

Provider information. I have intentionally added the “access_key” and “secret_key” parameter and hashed it out to ensure no one should add that information in provider information instead create AWS profile using “aws configure” command. My profile name is “ya”. A profile can be different for different application or implementation.

Image for post

Download source code of Terraform from below location –

https://github.com/yogeshagrawal11/cloud/tree/master/aws/athena/s3_to_athena/

Once source code is downloaded in the same directory as terraforming software. run following command —

aws configure : To configure AWS profile. Make sure appropriate roles are assigned.

terraform init : to initialize terraform

terraform plan : To review plan

terraform apply : To deploy your infrastructure

Run terraforming code to deploy as Infrastructure as code. This code will create two resources, AWS Glue database and AWS Glue Table

Image for post

Implementing using AWS console

AWS Glue

I am not using the crawler here since this document is the first step in the ETL process. The crawler is needed in case input data is not static. We need to create a schedule to run crawler periodically for new data.

Create the AWS Glue database

Image for post
Image for post

Create the AWS Glue table. The location is the S3 input location. Select S3 bucket and folder name where input data is stored. Add Glue table name. Select the appropriate Glue database name.

Image for post
Image for post

Select S3 as a type of source and bucket name by clicking the folder image next to the text box

Image for post

Select classification and Delimiter as per inputs. Avro and Parquet is a new growing data format technique. JSON still widely excepted by many IoT devices worldwide as output technique. Select the appropriate classification and delimiter. I

Image for post

Add all column

Image for post
Image for post
Image for post

Table properties

Since my data do have a double quote, I have to transform that data from multiple fields into a single field using quote Character. The default “serde serialization” library is also changed to openCSVSerde.

Modify table property Add appropriate Serde Serialization and Serde Pameters as mentioned below. Each parameter will de define how each row is parsed. If quote char found in a field, that field will be a single entity until closing quotes. For ex. in case of “Yellowstone” national park.

Image for post
Image for post

Column and Data type can be defined or automatically generated. Defining manually normally does have better results.

Image for post

Athena Magic

Use data source at “AwsDataCatalog” to connect Athena with AWS Glue.

Image for post

Enjoy running queries on data.

Image for post

Most National Park by state.

Image for post

Athena is a good tool to get data from different connector like AWS Document DB, AWS DynamoDB, Redshift, or even Apache HBase. That will be another story.

Conclusion

Amazon glue is a great tool to perform different types of ETL operations. Use a crawler schedule for running and updating data periodically. AWS makes things movement seamless. New jobs can be scheduled via Scala or Python.

Resources :

https://www.terraform.io/intro/index.html

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

Categories
AWS Compute

AWS Lambda 101

Lambda function Introduction

Lambda function is AWS offering commonly known for Function as a Service. AWS Lambda function help running code without provisioning or managing underline server. Many languages are supported by the Lambda function and the list is keep on growing. As of Jul2020, .Net, Go, Java, Node.js, Ruby, and my personal favorite Python among supported languages. Lambda is developed with high availability in mind and It is capable to scale during burstable requests.

We need to grant access to the Lambda function as per its use. Normally access is granted via the IAM role.

IAM Policy

Create a policy that will be able to create a log group and log stream. This is the basic execution rule required for the lambda function. Without this access, the Lambda function will not able to generate logs. Logging can be used for custom triggering events or tracking\debugging purposes.

Image for post
image-1
Image for post
image-2

IAM Role

The role is created to grant permission for specific tasks. In case, the function needed to access S3 get access to add appropriate policy into a role or create a custom policy. Always grant Least Privilege to function as per AWS security best practice.

Image for post

Select newly created policy

Image for post
image-4

Attach appropriate policy(image-5). Adding role description and tags are good practice in IAM. Click on Create Role.

Image for post
image-5

Lambda function

To create Lambda function, Goto Services and select Lambda.

Click the Lambda function.

Image for post
image-6

We have three options to choose from. Simplest on “Author from scratch”. In this option we will create “Hello world function” will also verify logging is working as per expectation.

“Use a blueprint”: AWS already created lots of useful functions that we can use to get started. Like, returns current status on AWS Batch job or retrieve an object from S3.

“Browse serverless app repository”: This will deploy sample Lambda application from different application repository. We also can use a private repository to pull code from.

Select an appropriate runtime environment. A select role that we have created in image-5.

Image for post
image-7

The designer will guide you on how the Lambda function is triggered. It can be triggered by different events like SNS topics, SQS, or event cloud watch logs. There are multiple different ways to trigger the Lambda function.

The Lambda function can be used for batch-oriented work or scripting purposes, you can use cloud watch rule to trigger cloud function using crontab at scheduled interval.

A destination can be an SQS event or SNS topic or event cloud watch log stream. We can also upload a read file from S3 and upload it to S3 after performing transformation within the Lambda function.

Image for post
image-8
Image for post
image-8a

This Hello world function is very simple, if it’s invoked from webhook it will return status code 200 with the body “Hello from Lamda!” It also writes log into the log stream. “event” and “context” been used to get input values as well as get HTTP context information that invokes the Lambda function.

An environment variable can be used to pass any static parameter to function like incase of downloading a file from S3 bucket, bucket name. Or while writing data into Dynamodb database. Its table name.

For security reasons, do not add your “Access Key” or “Secret key” values as an environment variable. One shall still use an encrypted parameter store for this purpose.

Image for post
image-9

The handler is the most important config parameter in the Lambda function. It has two parts separated by period (.)(image-10). The first part is nothing but a file name and the second part is a function definition that will read when the Lambda function is invoked(image-8a). I kept handler value default but I always recommend giving some meaning full name.

Memory is the amount of memory dedicated to Lambda function this depends upon activity you are performing and can be changed.

Timeout value determines how much time this function runs before times out. If the activity you like to perform will take more than the timeout value specified the Lambda function will be abruptly stopped. So, give some buffer time in the timeout value.

Image for post
image-10

Be default, the Lambda function does not require to be part of any VPC but in case Lambda functions needed to be able to communicate with EC2 instances or on-premise environment for data communication or invoking lambda function from EC2 instance we needed VPC configuration. You can still trigger the Lambda function using AWS SDK with VPC configuration.

It’s very common to store output data generated via Lambda function and store into Elastic Filesystem(EFS) since the Lambda function does not have static storage. We can use temporary ephemeral storage that lasted till the execution of function so EFS can be used to store all output.

Image for post
image-11

Permissions tab allows you to verify actions permitted for Lambda function on a given resource. The dropdown can be used to select on multiple resources. As per the below screenshot, the Lambda function can create Log group, Log stream, and able to put log messages on Amazon cloud watch logs.

Image for post
image-12

We can trigger the lambda function with different triggers. Here, I am creating a test event that will trigger the lambda function. Click on “Configure test events”. We are not sending any input value to function while invoking it. Key-value pair can be used (image-14) to send values to the Lambda function.

Image for post
image-13
Image for post
image-14

Once an event is created click on “Test” to invoke the Lambda function.

Image for post
image-15

The log output is shown below. Click on the “logs” URL to open the Cloud watch log group created by the Lambda function. This Log group will be available.

Image for post
image-16

Lambda event will create either a new Log stream or update into the existing log stream. Logs are put into the Log stream(image-17). Each log stream consists of many logs.

Image for post
image-16
Image for post
image-17

Deleting Lambda function

To delete the lambda function just select the Lambda function click on Actions and Delete.

Image for post
image-18
Image for post
image-19

The Lambda function successfully deleted.

Image for post
image-20

Cloud watch Log group and streams are not deleted by default. You can delete logs from cloud watch or exported into S3 for a cheaper cost.

To delete the log group. Go to Cloud watch select logs -> Log groups and select the appropriate log group names in the below format. Click on Action and delete Log groups.

/aws/lambda/<functionname>

Image for post
image-21
Image for post
image-22

Conclusion

Lambda function is the best way to run short jobs or create a script to run. This can each work with a webhook API or SNS/SQS environment. Have fun exploring multiple lambda function usage.