Categories
Analytics AWS miscellaneous

CSV, JSON conversion into Parquet

In this world of data explosion, every company working on consolidate data into common data format. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

AWS Glue

Source Data

I am using AWS Glue convert csv and json file to create parquet file. At this time I have some data in csv and some data in json format. CSV Data is stored in AWS S3 into source/movies/csv folder. JSON data is stored in AWS S3 into source/movies/json folder. All files are stored in those locations.

CSV input data

JSON input data

AWS Glue Implementation

Glue Classifier

A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.

AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that it’s 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.

CSV Classifier

I am creating CSV Classifier. Column delimiter is “,” and quote symbols are double-quote. Will also have heading.

JSON Classifier

Create json classifier

AWS Job Studio

CSV file reading job. Point source location to S3 location where csv folder is located

Please change long datatype to integer datatype.

Enter location where your parquet file needed to be store.

Create IAM role with following permissions –

  • S3 – read, list and write permission
  • Cloud watch – Log group and log stream creation as well as log insert permission
  • Glue – Service role

Create json job and difference source. Target should be same folder.

Now Parquet file is generated and saved at below location.

Create Athena table to access parquet file and list your records.

PS. After carefully looking my Athena query does not support long integer value from my parquet that needed to be fixed.

After fixing that I am able to get integer information.

Conclusion

Parquet file is comparatively faster than csv and json due to its columnar data structure so most data lake in industry started using it. As long as

Advertisement
Categories
Analytics AWS

AWS — ETL transformation

Introduction

ETL (Extract, Transform, and Load) data process to copy data from one or more sources into the destination system. Data-warehousing projects combine data from the different source systems or able to read a portion of data from a large set. Data is stored in a flat-file will be converted into Quarriable format at the presentation layer where analytics can be performed.

Meanwhile, AWS glue will be used for transforming data into the requested format.

Image for post
Architecture Design (image-1)

Extract

Data is placed in the S3 bucket as a flat-file with CSV format. These files or files will get transformed by glue.

The input file to test can be download from below link —

Transform

AWS Glue used to fetch data from flat files and create a Glue table using that file. I am working on a very small set of datasets with “US National Park” information. I have updated some of the data with additional fields that will break in normal transformation.

Input data — This has listed all US National Parks and its information

Image for post
image-2

Yellowstone Park has spanned across 3 states so quotes will be used to consider that field as part of a single-column “state”. The comma (“,”) as a separator will cause issues. State value can be just “‘Idaho” rather than “Idado, Montana. Wyoming”. So we need to transform this data into a meaningful one.

Image for post
image-3

Load

Upon creating a table from a flat-file. AWS Athena will be able to fetch data for analytics purposes or load it into a new environment. I am using “AwsDataCatalog” as an Athena data connector for AWS Glue.


Implementing using Terraform

Terraform software tools can be used to deploy infrastructure as code. We can change infrastructure efficiently and repeatedly with changing input parameters.

Following information needed –

· S3 bucket information and folder name where input data is present. As a best practice, I have added my bucket information into the parameter store rather than adding it into code. Make sure parameter “s3bucket” is created with value <s3://bucketname/>

Image for post

· Glue Database and table name

· Table Schema matching input parameter

Provider information. I have intentionally added the “access_key” and “secret_key” parameter and hashed it out to ensure no one should add that information in provider information instead create AWS profile using “aws configure” command. My profile name is “ya”. A profile can be different for different application or implementation.

Image for post

Download source code of Terraform from below location –

https://github.com/yogeshagrawal11/cloud/tree/master/aws/athena/s3_to_athena/

Once source code is downloaded in the same directory as terraforming software. run following command —

aws configure : To configure AWS profile. Make sure appropriate roles are assigned.

terraform init : to initialize terraform

terraform plan : To review plan

terraform apply : To deploy your infrastructure

Run terraforming code to deploy as Infrastructure as code. This code will create two resources, AWS Glue database and AWS Glue Table

Image for post

Implementing using AWS console

AWS Glue

I am not using the crawler here since this document is the first step in the ETL process. The crawler is needed in case input data is not static. We need to create a schedule to run crawler periodically for new data.

Create the AWS Glue database

Image for post
Image for post

Create the AWS Glue table. The location is the S3 input location. Select S3 bucket and folder name where input data is stored. Add Glue table name. Select the appropriate Glue database name.

Image for post
Image for post

Select S3 as a type of source and bucket name by clicking the folder image next to the text box

Image for post

Select classification and Delimiter as per inputs. Avro and Parquet is a new growing data format technique. JSON still widely excepted by many IoT devices worldwide as output technique. Select the appropriate classification and delimiter. I

Image for post

Add all column

Image for post
Image for post
Image for post

Table properties

Since my data do have a double quote, I have to transform that data from multiple fields into a single field using quote Character. The default “serde serialization” library is also changed to openCSVSerde.

Modify table property Add appropriate Serde Serialization and Serde Pameters as mentioned below. Each parameter will de define how each row is parsed. If quote char found in a field, that field will be a single entity until closing quotes. For ex. in case of “Yellowstone” national park.

Image for post
Image for post

Column and Data type can be defined or automatically generated. Defining manually normally does have better results.

Image for post

Athena Magic

Use data source at “AwsDataCatalog” to connect Athena with AWS Glue.

Image for post

Enjoy running queries on data.

Image for post

Most National Park by state.

Image for post

Athena is a good tool to get data from different connector like AWS Document DB, AWS DynamoDB, Redshift, or even Apache HBase. That will be another story.

Conclusion

Amazon glue is a great tool to perform different types of ETL operations. Use a crawler schedule for running and updating data periodically. AWS makes things movement seamless. New jobs can be scheduled via Scala or Python.

Resources :

https://www.terraform.io/intro/index.html

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

https://docs.aws.amazon.com/athena/latest/ug/what-is.html