Categories
Analytics AWS miscellaneous

CSV, JSON conversion into Parquet

In this world of data explosion, every company working on consolidate data into common data format. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

AWS Glue

Source Data

I am using AWS Glue convert csv and json file to create parquet file. At this time I have some data in csv and some data in json format. CSV Data is stored in AWS S3 into source/movies/csv folder. JSON data is stored in AWS S3 into source/movies/json folder. All files are stored in those locations.

CSV input data

JSON input data

AWS Glue Implementation

Glue Classifier

A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.

AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that it’s 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.

CSV Classifier

I am creating CSV Classifier. Column delimiter is “,” and quote symbols are double-quote. Will also have heading.

JSON Classifier

Create json classifier

AWS Job Studio

CSV file reading job. Point source location to S3 location where csv folder is located

Please change long datatype to integer datatype.

Enter location where your parquet file needed to be store.

Create IAM role with following permissions –

  • S3 – read, list and write permission
  • Cloud watch – Log group and log stream creation as well as log insert permission
  • Glue – Service role

Create json job and difference source. Target should be same folder.

Now Parquet file is generated and saved at below location.

Create Athena table to access parquet file and list your records.

PS. After carefully looking my Athena query does not support long integer value from my parquet that needed to be fixed.

After fixing that I am able to get integer information.

Conclusion

Parquet file is comparatively faster than csv and json due to its columnar data structure so most data lake in industry started using it. As long as

Categories
Machine Learning Pandas Python Python ML

Pandas Example’s

Pandas is python Library will be used for reading/writing large tabular dataset. Perform arithmetic operations on number data and manipulate textual data. Pandas’s Dataframes are highly used with pytorch environment.

Pandas Installation

Pandas can be installed using Anaconda or python virtual environment use following commands for different environment –

  • For Anacondas
    • conda install pandas
  • For Python virtual environment
    • pip install pandas

To import pandas in python program use.

import pandas as pd

Note : I assume that pd object is created in all my examples below

Pandas Series

Pandas series is one of the most used datatype. It is similar to Numpy array with one difference is that this series has axis labels which is treated as indexes. This can be number or string or any other python object.

Creating Series

  • Using Lists
    • To create Series using List. First create data and index list. Using those create ur pandas series. One thing to note here, data is number but index is string. I like to make is reverse but don’t get confused.
  • Using Python Dictionary
    • First create dictionary and use that dictionary to create pandas series.
  • Using NumPy
    • Creating Pandas Series using NumPy array. If Index is not defined will creating Pandas Series then automatic numeric index created starting from Zero(“0”) and incremented for each row.

Pandas Dataframe

While working with tabular format of data Pandas DataFrame is correct tool. Dataframe will help you cleaning and process your input data. With column and row indexing property and data can be retrieved easily. Each Dataframe object consist of multiple Pandas Series. When we recall any column information from Pandas Dataframe, its output is Pandas Series.

Each row is presented by row index of Dataframe. Each row present on Axis=0 where as each column is on Axis=1.

To create Dataframe we still use NumPy library. Please follow my NumPy examples webpage in case you need information on NumPy. Each DataFrame object need 3 types of data –

  • Data
  • Row id or Row no it also called as “index”
  • Column name also called as “headers”

Dataframe can be created with dictionary object with key as column name. Each array shape should be of same value

If index and column is not mentioned while creating DataFrame then default column and Index starts from Zero(0)

Example shows Pandas dataframe created with index name and column names.

To get object type from the dataframe. Use dtype function. String object is considered as object type.

To get head and tail or each dataframe use head() and tail() function. To create random array I am using numpy. To view specific number of rows use integer value in function default is 5.

To get Column names are Row names(indexes) use <DataFrame>.index and <DataFrame>.columns

To get all statistics about data for your columns

To transpose your data use <DataFrame>.T. I have total 20 rows earlier that transposes to columns.

Sorting

Sorting by row(index)

Sorting by Column(value). So the values of col2 will sorted in ascending order. Use “ascending=False” to make it in descending order.

Selecting Data

To get all data for a given column use <DataFrame>[“column Name”] . To get multiple column provide list of column names.

List of columns

To get specific rows of data use, Row id’s

To get Specific rows and column use following multi-axis <DataFrame>.loc function

To get specific scaler value use <DataFrame>.at function.

Conditional Selection

To get all rows where col4 value is greater than Zero. Any arithmetic conditional statements can be u

Groupby

Groupby function is used as aggregation function for common columns. Multiple column can be used as list while grouping.

Merge

Two dataframe can be merge together with merge function on a given rows. If row value does not present that pd.NAN will be added in the group.

Daterange

Daterange function uses period as “D” for daily , “M” for month etc. That can be used as indexes for values

Exporting Data

To CSV

To write pandas data to csv use to_csv function. If path is not specified file is saved at same location as ur notebook\python file location.

read data from csv

Conclusion

Pandas is vast topic. My objective to get you started. More pandas documentation can be viewed https://pandas.pydata.org/