In this world of data explosion, every company working on consolidate data into common data format. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.
Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.
Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.
I am using AWS Glue convert csv and json file to create parquet file. At this time I have some data in csv and some data in json format. CSV Data is stored in AWS S3 into source/movies/csv folder. JSON data is stored in AWS S3 into source/movies/json folder. All files are stored in those locations.
CSV input data
JSON input data
AWS Glue Implementation
A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns
certainty=1.0 during processing, it indicates that it’s 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.
I am creating CSV Classifier. Column delimiter is “,” and quote symbols are double-quote. Will also have heading.
Create json classifier
AWS Job Studio
CSV file reading job. Point source location to S3 location where csv folder is located
Please change long datatype to integer datatype.
Enter location where your parquet file needed to be store.
Create IAM role with following permissions –
- S3 – read, list and write permission
- Cloud watch – Log group and log stream creation as well as log insert permission
- Glue – Service role
Create json job and difference source. Target should be same folder.
Now Parquet file is generated and saved at below location.
Create Athena table to access parquet file and list your records.
PS. After carefully looking my Athena query does not support long integer value from my parquet that needed to be fixed.
After fixing that I am able to get integer information.
Parquet file is comparatively faster than csv and json due to its columnar data structure so most data lake in industry started using it. As long as