Categories
Analytics AWS miscellaneous

CSV, JSON conversion into Parquet

In this world of data explosion, every company working on consolidate data into common data format. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

AWS Glue

Source Data

I am using AWS Glue convert csv and json file to create parquet file. At this time I have some data in csv and some data in json format. CSV Data is stored in AWS S3 into source/movies/csv folder. JSON data is stored in AWS S3 into source/movies/json folder. All files are stored in those locations.

CSV input data

JSON input data

AWS Glue Implementation

Glue Classifier

A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.

AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that it’s 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.

CSV Classifier

I am creating CSV Classifier. Column delimiter is “,” and quote symbols are double-quote. Will also have heading.

JSON Classifier

Create json classifier

AWS Job Studio

CSV file reading job. Point source location to S3 location where csv folder is located

Please change long datatype to integer datatype.

Enter location where your parquet file needed to be store.

Create IAM role with following permissions –

  • S3 – read, list and write permission
  • Cloud watch – Log group and log stream creation as well as log insert permission
  • Glue – Service role

Create json job and difference source. Target should be same folder.

Now Parquet file is generated and saved at below location.

Create Athena table to access parquet file and list your records.

PS. After carefully looking my Athena query does not support long integer value from my parquet that needed to be fixed.

After fixing that I am able to get integer information.

Conclusion

Parquet file is comparatively faster than csv and json due to its columnar data structure so most data lake in industry started using it. As long as

Advertisement

By Yogesh Agrawal

Google Certified Professional | Cloud Enthusiastic | Solution Designer | Love to do woodwork & Oil Painting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s