In this world of data explosion, every company working on consolidate data into common data format. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.
Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.
Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.
I am using AWS Glue convert csv and json file to create parquet file. At this time I have some data in csv and some data in json format. CSV Data is stored in AWS S3 into source/movies/csv folder. JSON data is stored in AWS S3 into source/movies/json folder. All files are stored in those locations.
CSV input data
JSON input data
AWS Glue Implementation
A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that it’s 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.
I am creating CSV Classifier. Column delimiter is “,” and quote symbols are double-quote. Will also have heading.
Create json classifier
AWS Job Studio
CSV file reading job. Point source location to S3 location where csv folder is located
Please change long datatype to integer datatype.
Enter location where your parquet file needed to be store.
Create IAM role with following permissions –
S3 – read, list and write permission
Cloud watch – Log group and log stream creation as well as log insert permission
Glue – Service role
Create json job and difference source. Target should be same folder.
Now Parquet file is generated and saved at below location.
Create Athena table to access parquet file and list your records.
PS. After carefully looking my Athena query does not support long integer value from my parquet that needed to be fixed.
After fixing that I am able to get integer information.
Parquet file is comparatively faster than csv and json due to its columnar data structure so most data lake in industry started using it. As long as
Pandas is python Library will be used for reading/writing large tabular dataset. Perform arithmetic operations on number data and manipulate textual data. Pandas’s Dataframes are highly used with pytorch environment.
Pandas can be installed using Anaconda or python virtual environment use following commands for different environment –
conda install pandas
For Python virtual environment
pip install pandas
To import pandas in python program use.
import pandas as pd
Note : I assume that pd object is created in all my examples below
Pandas series is one of the most used datatype. It is similar to Numpy array with one difference is that this series has axis labels which is treated as indexes. This can be number or string or any other python object.
To create Series using List. First create data and index list. Using those create ur pandas series. One thing to note here, data is number but index is string. I like to make is reverse but don’t get confused.
Using Python Dictionary
First create dictionary and use that dictionary to create pandas series.
Creating Pandas Series using NumPy array. If Index is not defined will creating Pandas Series then automatic numeric index created starting from Zero(“0”) and incremented for each row.
While working with tabular format of data Pandas DataFrame is correct tool. Dataframe will help you cleaning and process your input data. With column and row indexing property and data can be retrieved easily. Each Dataframe object consist of multiple Pandas Series. When we recall any column information from Pandas Dataframe, its output is Pandas Series.
Each row is presented by row index of Dataframe. Each row present on Axis=0 where as each column is on Axis=1.
To create Dataframe we still use NumPy library. Please follow my NumPy examples webpage in case you need information on NumPy. Each DataFrame object need 3 types of data –
Row id or Row no it also called as “index”
Column name also called as “headers”
Dataframe can be created with dictionary object with key as column name. Each array shape should be of same value
If index and column is not mentioned while creating DataFrame then default column and Index starts from Zero(0)
Example shows Pandas dataframe created with index name and column names.
To get object type from the dataframe. Use dtype function. String object is considered as object type.
To get head and tail or each dataframe use head() and tail() function. To create random array I am using numpy. To view specific number of rows use integer value in function default is 5.
To get Column names are Row names(indexes) use <DataFrame>.index and <DataFrame>.columns
To get all statistics about data for your columns
To transpose your data use <DataFrame>.T. I have total 20 rows earlier that transposes to columns.
Sorting by row(index)
Sorting by Column(value). So the values of col2 will sorted in ascending order. Use “ascending=False” to make it in descending order.
To get all data for a given column use <DataFrame>[“column Name”] . To get multiple column provide list of column names.
To get specific rows of data use, Row id’s
To get Specific rows and column use following multi-axis <DataFrame>.loc function
To get specific scaler value use <DataFrame>.at function.
To get all rows where col4 value is greater than Zero. Any arithmetic conditional statements can be u
Groupby function is used as aggregation function for common columns. Multiple column can be used as list while grouping.
Two dataframe can be merge together with merge function on a given rows. If row value does not present that pd.NAN will be added in the group.
Daterange function uses period as “D” for daily , “M” for month etc. That can be used as indexes for values
To write pandas data to csv use to_csv function. If path is not specified file is saved at same location as ur notebook\python file location.
NumPy is package for scientific computing. It provides library for multidimensional array. This library can perform many mathematical operation on large set of dataset. Also helpful for sorting large dataset and performing IO operations. This python library can be using for random simulation data generator.
Main NumPy object is “ndarray”. This object can be single dimensional to multi-dimensional array for same data-type. “ndarray” can be relaed to python List but its working is far different. “ndarray” is fixed size object in case you wanted to increase or decrease size or shape of ndarray, NumPy will create new object and delete old object.
NumPy Use cases
Importing large dataset
Performing mathematical computation over large dataset
Perform efficient way of sorting
Random data generator for AI workflow
You can Anaconda or Python environment to install Numpy. Its very simple installation for NumPy. Just enter following command
pip install numpy
To check numpy installed successfully. Try importing module in your python code from python cli
>>> import numpy
NumPy Basic Array
Simgle dimentional array is created using following command. First import numpy as np(generally imported as np but can use other name. Then create your array.
np.zeros() and np.ones()
To create array with all zeros or ones use following command. Just provide length of array. I am using Jupyter notebook for simplicity but same commands can be run from python CLI or python script.
By default zeros() and ones() function creating array with float type but one can use following function to create integer. Convert array with argument dtype=np.int64
Create sequential array using “arrange” function. This will create integer starting from 0(zero) till incremented by 1(one).
Arange function can be used with step element. if want to create array with from 1000(including) till 10000(excluding). We can use following command. First argument is low number it was included, second no is high number which is excluding and step function how many steps we are jumping. To get even no, stepping 2.
use np.linspace() to create an array with values that are spaced linearly in a specified interval. This function will create equal space between first arg and second arg in regular interval. First and last no are including in interval.
Create random array between Zero(0) and One(1). This will create float random array.
Create random array for integer value. Low number is including and high number is excluding. Size is array size. It can be single dimentional or multi dimentational array.
NumPy N-dimensional Array
Create two dimensional array using numpy.
Create N-dimentional integer array –
Reshape commands will change array shape. Please ensure that array should have same numbers of elements with target shape. For example. If we need to create 4×3 array source array should have at exactly 12 element. For 4×4 target array source need to have 16 element.
Getting shape and size of Array
Below functions are attribute of array and not function.
To get total no of element in the array :
to get current shape for an array. This parameter come very handy when array or matrix multiplication is performed.
to get current array dimension
Sorting and Joining Arrays
np.sort() is simple way to sort any NumPy array.
np.concatenate used to concatenate array. You can concatenate only same dimensional or shape array –
2-D addition concatenation
To add element at the end of array
To delete element from array –
To reverse array use np.flip()
Indexing and slicing
NumPy arrays are working same as python list. Please find bellow examples. Indexes starts from Zero.
Stored sliced array into new array to use sliced array –
To update value in element use indexes for that –
Indexing for multidimensional works same as 1D. Just use tuple each dimension. For Example, for 2D use arr_name[3,4] to get 4th row and 5th element.
Normally, if not specified output from column is single array. Use reshape command to get output properly stored on respective row/column format.
Arithmetic Manipulation of array Elements
Main usecase for NumPy is arithmetic manupulation. NumPy makes this very easy.
In below example, we are adding 5 to each element in the array. This operation is performed to all element in array. This does not make change to existing array if you want to make change perform assignment. Again, as I mentioned earlier, this will assign new memory.
NumPy make it easier for boolean operation. Boolean operation will create array of all element which satisfy operation. In this case, b[b>5] will sent list of all elements in a array.
To get list of all indices where condition is satisfied use np.nonzero()
Arithmetic Manipulation of Array
To add and subtract array only works on same shape of array. If shape is not same it will not work.
As long as columns are matching newly added matrix will be added to all rows.
Matrix axis definition
Each array has different axis. In two dimensional array –
For Column : axis=0
For Rows : axis=1
As per below example if you select axis=0 that means it will return value of maximum value from appropriate rows and columns.
NumPy is very good library for large number of dataset and can be useful tool for data manipulation with minimum code. We can perform same task from python but with NumPy library its easy.
An old colleague of mine reached out to me for creating a random string within cloud formation. If one has not used it in past it can get tricky. I wish amazon created a function for the same but then how would I have showcased my love to SERVERLESS with this blog. I will be using Lambda function for creating random strings and Cloud formation custom function to call that lambda function.
The parameter needed for a random string is the length of the string. By default, this template will create 6 char string.
Lambda function and respective execution role. I have created this lambda function in python. Since this will be created once and used many times in the future don’t need any expertise for python language. Just use this function as is that will suffice usage.
This template will export Lambda function ARN (RandomizerLambdaArn). This Lambda function arn will be used in the custom resource into a service token section.
Calling Randomizer in your template (CreateBucketTemplate.yaml)
This is a way to call randomizer function in the template. Copy-paste this function into a template where a random string is needed.
To get random string use the following commands –
RandomizerLambda.RandomString for random string
RandomizerLambda.Lower_RandomString for lower case random string
RandomizerLambda.Upper_RandomString for upper case random string
Lambda function is created. Lambda execution role and policy created you can use existing role as well if needed to reduce role count. Lambda function will create AWS Cloudwatch loggroup and logstream for Lambda function metrics and output information. This is very useful. One can use this lambda function and parameters like a project, stackname, application name in stack output which can be tracked as well for accounting or analysis purposes.
Randomizerstack has a default input character length as 6 but it can be changed upon request of the stack.
Outputs do have random string-like, alphanumeric character, numeric character or just lower alphabets(used for S3 bucket name)
Validating bucket creation template.
Creating a bucket using a template. I am passing parameter value of 10 is nothing but I needed 10 character string for bucketname
Lambda function will create Cloudwatch loggroup and log stream. Delete those log group by going to Cloudwatch -> Log groups -> Select appropriate log group by filtering “randomizer”. Select checkbox. Go to action and click delete.
Use this randomizer template for the need of a randomizer string. Very useful for ami name, autoscaling group name, and S3 bucket names.
PS. Security is not in mind with this blog. The intention is purely to kickstart my builders.
AWS Lambda is serverless managed service. Lambda code will run without managing any server and provisioning any hardware. This post is not for Lambda feature but to Lambda Endpoint configuration. I am intend to keep that way.
Endpoints are used to connect AWS service programmatically. This connection uses “AWS privatelink“. In case organization does not want to expose VPC to public internet to communicate with AWS Lambda, endpoint comes to rescue. Endpoint is regional service.
Endpoint are two types –
I am using Endpoint Interface in my design. I have created two lambda function for testing. When endpoint is created and configured, we need to associate subnet with endpoint. Endpoint creates interface in environment and add allocate ip in all subnet where Endpoint is associated with. This interface\ip is been used for lambda invoking. Both lambda function will use same interface ip for communication.
Note : This is one of the usecase of endpoint. Main, usecase for endpoint is sharing lambda function or any other service as Software as service. Clients can share this service across AWS accounts or even across organization.
Above design, we can still implemented by defining VPC interface while creating Lamda function. Drawback here, if you have 10 lambda function to share in the VPC, we need 10 ip from subnet which is overkill. With endpoints you need just 1 ip to invoke all lambda function for given AWS account.
Download “lambda-endpoint-connection.yaml” file from below link –
Create user with access to run formation template.
Install AWS SDK and configure AWS environment. I am using “us-west-2” region. In case if you want use different region please use appropriate AMI from parameter. (My apologies, I have added all regions ami in parameter)
Open file “lambda-endpoint-connection.yaml” in any text editor and change following ur systemip. default value is ssh is allowed from all instances
Create and Download Ec2 instance keypair and update keypair name in this field. Download keypair key file same location as cloud formation template.
Add correct VPC cidr information. If not, default will be used.
Cloud formation template will create following resources(not all resource mentioned in list) –
VpcName : VPC for this test
Subnet1 : Subnet totally private except to ssh
AppInternetGateway : Internet gateway used just to connect my system with EC2 instance.
AppSecurityGroup : Allows port no 22 from my system to EC2 and allows all communication within VPC
EC2AccessLambdaRole : This role allows EC2 instance to invoke lambda function.
LambdaRole : This role allows Lambda function to create log groups in cloud watch to check print output in cloud watch
RootInstanceProfile : Instance profile for instance. Uses EC2AccessLambdaRole for assuming permission
EC2Instance : Instance to invoke lambda function
LambdaFunction : First lambda function
SecoundLambdaFunction : Second lambda function
LambdaVPCEndpoint : Lambda vpc endpoint
Run following command to validate template is working fine
This will create stack in background it will take couple of minutes. Check your stack is created successful in Events section of cloud formation.
Ensure Stack is created successfully.
Stack outputs are saved in key value pair. Take a note of Instance publicIP. We need this output for ssh into EC2 instance to check lambda access. Take note of FirstLambdaFunction and SecondLambdaFunction values we need these value to invoke Lambda function.
Ensure two lambda functions created successfully. Keep a note of both function name. We need function name for invoking from our EC2 instance.
VPC – Endpoint configuration is created. EC2 instance internally created via private DNS name. That name is derived as <servicename>.<region_name>.amazonaws.com.
In out case servicename is “lambda”.
Endpoint assigns ip address in all allocated subnets. In our case we have assigned VPC to just one subnet so that it assigns single ip address. IP 10.1.1.78 is part of subnet where endpoint is associated with.
Assign security group to endpoint. In case need to stop access for any EC2 instances to access lambda function. This security group can be used for security reason. We can use Iam policy as well to restrict access from invoking Lambda function.
Policy definition. Full access allows any user or service to access lambda function. I highly recommend to restrict access from any services or EC2 instance via Endpoint policy and security group.
Endpoint creates network interface in VPC environment. IP is assigned to this network interface.
Subnet ip count also shows ip counts is reduced for /24 masked subnet.
Routed table has route with internet gateway to connect my system via ssh.
Security group only allows access to port 22 from entire world and all ports are open within VPC communication for inbound and outbound traffic.
Login to newly created instance use same keypair that created during pre-req phase –
Configure AWS with region “us-west-2” or select any region you may like to select.
To check list of functions using “aws lambda list-functions“
To invoke function use following command. We dont have access to any external https connection but we are still able to access lambda function.
Lambda Endpoint is new feature and connects lambda via AWS privatelinks via AWS internal network. Again, security is elevated as no need to open your VPC to external traffic for lambda execution. Great way to use Lambda for function as service or using Lambda across multiple AWS accounts across organization.
Networking is a big challenge with growing demands on diversified environments and creating datacenter across the world. Limit is just imagination. Enterprises works around different sites, different geography but common vein that join those environments are Network. With growing demands, it’s getting complicated to manage routes between sites. AWS Transit Gateway(TGW) is born to make Network Engineers life easy. TGW helps with following features –
Connect multiple VPC network environment together for given account
Connect multiple VPC network across multiple AWS account
Inter region connectivity across multiple VPC
Connect on-premise datacenter with VPC network via VPN or
Connect multiple cloud environment via VPN using BGP.
Benefits of Transit Gateway
Easy Connectivity : AWS Transit Gateway is cloud router and help easy deployment of network. Routes can\will be easily propagated into environment after adding new network to TGW.
Better visibility and control : AWS Transit Gateway Network Manager used to monitor Amazon VPC’s and edge locations from central location. This will helps to identify and react on network issue quickly.
Flexible multicast : TGW supports multicast. Multicast helps sending same content to multiple destinations.
Better Security: Amazon VPC and TGW traffic always remain on Amazon private environment. Data is encrypted. Data is also protected for common network exploits.
Inter-region peering : AWS Transit Gateway inter-region peering allows customers to route traffic across AWS Regions using the AWS global network. Inter-region peering provides a simple and cost-effective way to share resources between AWS Regions or replicate data for geographic redundancy.
Transit Gateway Components
There are 4 major components in transit gateway –
Attachments : Attach network component to gateway. Attachments will be added to single route table. Following are network devices can be connected to TGW –
An AWS Direct Connect gateway
Peering connection with another transit gateway
VPN connection to on-prem or multi-cloud network
Transit gateway route table : Default route table will be created. TGW can have multiple route table. Route table defines boundary for connection. Attachments will be added to route table. Given route table can have multiple attachment where as attachment can only be added to single route table.
Route table includes dynamic and static routes. It determine next stop by given destination ip.
Association : To attach your attachment to route table we use association. Each attachment is associated with single route table but route table can have multiple attachments.
Route Propagation : All VPC and VPN associated to route table can dynamically propagate routes to route table. If VPN configured with BGP protocol then routes from VPN network can automatically propagated to transit gateway. For VPC one must create static routes to send traffic to transit gateway. Peering attachments does not dynamically added routes to route table so we need to add static routes.
We are going to test following TGW scenarios. In this architecture design I am creating “management VPC” that will be shared for entire organization. This VPC can be used for Active Directory, DNS, DHCP or NTP like common services for organization.
Project_VPC1 and Project_VPC2 will be able communicate with each other and managent_vpc. Private_VPC is isolated network(private project) and will not be able to communicate with project VPC’s, but should be able to communicate with management_vpc.
Following is architecture for this design –
Ami ID – “ami id” depend upon region
Instance Role. We dont need this one explicitly, as we are not accessing any services or environment from instances.
Instance key pair : Create instance key pair. Add key pair name in parameterstore. Parameter name should be – “ec2-keypair”. Value should be name of your keypair name.
I am using terraform for implementation. Following is output for terraform –
Total 47(not 45) devices are configured.
4 VPC created
4 subnet created if you observe available ips are 1 less because one of the ip in a subnet will be used by transit gateway for data transfer and routing.
4 Route table created. Each route table will use transit gateway as target for other VPC network.
Security group – These are most important configuration configuration in real world. For DNS you will allow port 53 or AD server open appropriate ports. In my case, I am using ping for checking communication.
Private VPC will only able to communicate with management network.
Project VPC will able to communicate with other VPC where as not able to communicate with Private VPC.
Note : We don’t have to explicitly, block network in project VPC this should be blocked by transit gateway as we are not going to add propagation.
Transit Gateway created. Remember if 64512 ASN is used by existing VPN then this can be added as parameter to change it.
DNS support enables help to reach out cloud with dns names rather than ip address ,certainly a useful feature.
Transit gateway can be shared with other transit gateway for inter regional data transfer for VPC’s over Amazon private network. Its advisable to “disable” auto accept shared for security reason.
Default route table is created and all VPC not explicitly attached will be attached to default route table.
Each VPN needed to add to transit gateway as attachment.
Each route table is created. Route table can be created as per segregation one needed into environment. In my case I am creating 3 route table for 4 VPC’s. Generally in Enterprise environment, we do create 5 route table. Separate route table to backup and security environment.
Since project VPC1 and project VPC2 should have same network requirement so I added them to same route table.
Management Route table
Management route table has management VPC attachment. Propagation added from all network which needed to communicate with management VPC. In this case, management VPC should be able to communicate with all other network so added propagation from all networks. This will add all routes propagated automatically.
Private Network Route table
Private VPC attached to private network route table. Private network should be able to communicated with management VPC so added propagation for management VPC. Also route for management VPC is added automatically after propagation.
Project Route table
Project route table do have attachment from both project VPC. Propagation added for other project VPC network and management network. Respective routes are added.
Management server is able to ping both private and project environment instance.
Project VPC can talk to management VPC and other project VPC’s but not with private VPC
Private VPC able to talk to management vpc but not able to communicated with any project VPC’s. That makes private VPC private within organization.
Delete terraform configuration
To delete terrform configuration. Ensure all resources are destroyed
./terraform destroy –auto-approve
Transit gateway is tool to connect multiple VPC, VPN and direct connect network to make communication over private network. Transit gateway can be used to isolate network traffic. This makes routing comparatively easy.
SD-WAN partner solution can be used to automate adding new remote site into AWS network.
Multi-cloud Architecture is a smarter way to utilize public, private and hybrid environment. All enterprises wanted to have an option to choose multiple cloud providers for their usecases. Multi-cloud is now a days very popular for Enterprise and mid level companies. Following are benefits and considerations for selecting any multi-cloud environment.
Redundancy: Having more than one cloud provider, helps in redundancy. In case, if particular region for given cloud provider fails or any service fails we can configure redundancy by adding another cloud provider.
Scalability: This point may be not be that important but definitely worth to consider. Sometime, its lengthy process to increase resource limit for cloud account that can be safeguarded via having multiple cloud provider.
Cost: Cost can also be viewed at competition. Some of the services are cheaper on one cloud environment some or on another. This helps determine cheapest solution for enterprise.
Features: This is prime reason for multi-cloud environment. Having multicloud environment gives you flexibility to choose best suitable environment as per application needed than just to choose whatever is available at the time.
Customer Lock-in: Some of vendors have lock-in period for specific service. Enterprises always\rather mostly wish to avoid this lock-in time. Having multiple cloud we have more option on choosing correct vendor.
Nearest termination point/Customer Reach : Use of regional cloud provider will help enterprise to be near datacenter or user. This will improve performance and reduce latency issues. On top of that, each cloud provider’s global reach is different. So implement appropriate cloud provider whose reach is better for end user.
This procedure can be implemented for any vpn connection with BGP protocol. I am using Dynamic routing but static routing can be used as well. Below is architecture diagram for my VPN connectivity.
Download Terraform software version
AWS and Google account should be configured for terraform access.
I am using “us-west-2” region for AWS, “us-west1” region for google. If you are planning to use different region select appropriate instance image id and update image id
Create EC2 instance keypair and add keypairname inforamtion into parameterstore.
Change BGP IPs. I am using default one, these IPs should work in case those are not used for your existing environment.
GCP VPN gateway ip information is matching with customer gateway. Forwarding rules are mandatory for tunnel creation. Terraform will create those rules automatically.
AWS Private Gateway will have next ASN no. Its advisable to use next number. I always follow best practice, to use Odd no for certain provider like GCP and even no for certain provider like AWS. This configuration will also works with on-premise network device in that you will define precedence. All on-premise devices will have lower ASN number and so on.
Attach your Site-to-site VPN connection with Virtual private gateway and Customer gateway. This will create one vpn connection with Customer Gateway(GCP VPN Gateway) and AWS Virtual private gateway. I am using “ipsec.1” for connection type.
This will also create two tunnel. I am using dynamic routing. BGP protocols have limitation for 100 subnet that can be exchanged between vpn when this blog is posted. Tunnel information as follows –
Tunnel IP address issue
Tunnels are configured properly but in down stage because corresponding GCP tunnels are not created. I tried to create those tunnel using Terraform but issue is happening that both AWS and GCP were taking own ip as first ip(169.254.1.9) from 169.254.1.8/30 subnet. And second ip will be allocated as peer ip(169.254.1.10). On contrary, we have AWS ip as first ip and second ip in subnet should be used by GCP cloud router.
Correct BGP IP for GCPs are
Tunnel 1 – Cloud router ip 169.254.1.10(second ip in subnet) and BGP peer ip(from AWS) = 169.254.1.9(which is correctly configured)
Tunnel 2 – Cloud router ip 169.254.1.14(second ip in subnet) and BGP peer ip(from AWS) = 169.254.1.13(which is correctly configured)
Create Tunnels in GCP
Now create two tunnels in GCP VPN gateway tunnel with following configuration –
Remote Peer IP address : 188.8.131.52 Value from Terrform output “aws_tunnel1_public_address“
IKE Verion = 1
IKE pre-shared key = Copy value from “vpn_sharedkey_aws_to_gcp_tunnel1” parameter from AWS parameter store. Note. Do not copy trailing spaces.
Cloud Router = gcp-cloud-router
BGP session Information –
bgp name = bgp1
peer ASN = 65002
Cloud router BGP IP = 169.254.1.10 value of “aws_tunnel1_inside_gcp_address” from terraform output
BGP peer IP = 169.254.1.9 value of “aws_tunnel1_inside_aws_address” from terraform output
Perform same activity on tunnel-dynamic2 with following details –
Remote Peer IP address : 184.108.40.206 Value from Terrform output “aws_tunnel2_public_address“
IKE Verion = 1
IKE pre-shared key = Copy value from “vpn_sharedkey_aws_to_gcp_tunnel2” parameter from AWS parameter store. Note. Do not copy trailing spaces.
Cloud Router = gcp-cloud-router
BGP session Information –
bgp name = bgp2
peer ASN = 65002
Cloud router BGP IP = 169.254.1.14 value of “aws_tunnel2_inside_gcp_address” from terraform output
BGP peer IP = 169.254.1.13 value of “aws_tunnel2_inside_aws_address” from terraform output
upon changing this configuration both tunnel should be ip and running at GCP and AWS environment. Try refreshing page in case status is not changed
This completed our network connectivity between AWS and GCP environment.
To test, I am going to login to my AWS instance with my keyname that is defined in “parameterstore”. Use following ip from terraform output
We have allowed ICMP protocol for ping and “ssh” port from AWS environment to GCP environment so will test try to ping GCP instance’s private ip instance from AWS private ip address.
Voila. I could not login to GCP because I have not copied instance json file to EC2 so that ssh will be correctly authenticated.
GCP instance access on external IP
Ping test from AWS EC2 over GCP public instance IP is failed as expected because of two reason, we dont have internet gateway setup on GCP VPC secondly we have not allowed ICMP and ssh in firewall from external world.
Test is successful.
Deletion of environment
Since we created GCP tunnel separately, we need to delete those tunnel before deleting infrastructure using terraform
Go to GCP > VPN > Cloud VPN Tunnels
Select both newly created tunnels and click “Delete”
Once tunnel is deleted, Run following command from Terraform environment –
./terraform destroy –auto-approve
Make sure all 25 resources are deleted.
Multi cloud is new normal and private network connectivity that everyone wanted to have. I have given example of compute instances but this can be extended with multi level architecture. Try to get best of both world by implementing this solution.
Terraform is open source software managed by Hashicorp. This software used at Infrastructure as a code.
Terraform manages external resources (such as public cloud infrastructure, private cloud infrastructure, network appliances, software as a service, and platform as a service) with “providers”. HashiCorp maintains an extensive list of official providers, and can also integrate with community-developed providers. Users can interact with Terraform providers by declaring resources or by calling data sources. Rather than using imperative commands to provision resources, Terraform uses declarative configuration to state the desired final state. Declarative configuration means we want to write a code your system should be in after completing activity. In case some of the resources are already created then after running terraform job system will only create or modify resource which job needed to be at final state.
Once a user invokes Terraform on a given resource, Terraform will perform CRUD(Create, Read, Update and Delete) actions on the user’s behalf to accomplish the desired state. The infrastructure as code can be written as modules, promoting reusability and maintainability
We are clicking pics every day and the Image datastore industry is spreading its way to our lifestyle. Massive amounts of images are kept on adding every day. In this story, I like to present a tool to search for images of a given object or celebrity like Google images. Don’t get me wrong, this is nowhere near Google images. Google images crawl weblinks. This story just belongs to same object-store.
Images will be copied and stored in the S3 bucket. I am using external tools to copy images. This external tool can be anything like S3 CLI or simple AWS. S3 state change event will trigger the Lambda function to perform image recognition analysis. I am performing two types of analysis, general analysis about environment/object, secondly celebrity analysis. Once the analysis is performed, data will be stored in Dynamodb. Dynamodb is using “keyname” from S3 as the primary key for the images. All labels generated from image recognition will be stored as an attribute in the newly assigned item.
API gateway will be used to search for any images containing any value or celebrity. That will trigger the Lambda function will generate a pre-signed URL for each image and deliver it to the client. This pre-signed URL will expire in 10mins if the user will not download those images.
myregion = Region name where all environment is setup. Multiregion setup needed configuration change with Load balancer
imagedb = Dynamodb table name
Create dynamodb table with the primary key as string and name of primary key attribute “s3key”. “s3key” attribute will store image S3 keyname.
s3bucket = S3 bucketname
Create S3 bucket named specified in parameter store. Create “/image” folder where all images to be copied.
Create Two IAM roles. First IAM role used with Readwrite access to Dynamodb, Log steam, image recognition and S3 access. Function(image_process_function.py) will assigned this role. Policy information as below. I am using AWS managed policy for simplicity but ensure to use the appropriate role of minimum access. Use following AWS managed policy —
Second role use for second lambda function (search_images_from_db_function.py) used read DynamoDB database for correct images and keyname. Following are AWS managed policy should be added into role
Create empty Dynamodb table “imagerek” to store all label information into database. Primary key for this database should be “s3key”. If primary key is not named as S3key this solution will not work.
Image function will get triggered after uploading images to S3 system. Function will perform two image recognition operations. First will verify all object and label all object discovered from image. Python function definition – “rek_labels”.
The second part of this function will check for images for any celebrity present. Python function definition – “rek_celebrities”.
Upon gathering information function will add this information into dynamodb table that has specified in the parameter store. The primary key for this image is “keyname” from S3 bucket.
Lambda function (search_images_from_db_function.py)
Second Lambda function will be used to search images that input is provided by API gateway. Upon inputs are received, images will be searched for specific keywords in dynamodb database.
Once the file keyname is received same function will create “pre-signed” url for images and send those links back to API gateway as html page.
Image’s pre-signed url will be sent back to as html page that will be displayed by api gateway. In real life scenario, images will be processed and presented by application\web layer.
Images uploaded to S3
Use any technique to upload images to S3 storage. One can copy images to S3 storage via cli, boto sdk, Rest API or any other custom application. Once images are uploaded lambda function will be triggered. Ensure to create “image” folder into S3 bucket and upload all images to folder. Please ensure lambda functions are deployed before images are uploaded to S3 bucket.
An idea if this design mainly centered around solution designing than developing an application. So I am using API gateway to send inputs to the Lambda function. Currently, the application does not support for multiple inputs but certainly can be added. After receiving responses from Lambda function, API will display images.
API gateway configuration
Default stage will be used. For better CI/CD process, try using canary method for new version deployment.
Selected url will be used to search for image.
Search link is api url then “?searchfor=” and things to search
<API gateway url>/?searchfor=<things to search>
I am going to search some of the images those are uploaded as testing images.
Images are used for educational purpose. Anyway if its not appropriate to use images, please post comments I will remove it.
In this story, I am planning to create three-tier architecture with the help of AWS resources. First-tier Load Balancer, Second tier(webserver) considered as application logic and last tier Database. I am using Dynamodb for the NoSQL database.
An auto-scaling group is created with a minimum of 2 instances. ASG has two subnets both in a different availability zone. This auto-scaling group will be used as a target group for application Load Balancer. In my configuration, instances are not reached directly via there public address over port 80 will only application load balancer will be forwarding a request to EC2 instance. Session get terminated at the application load balancer.
Two buckets are needed, the first S3 bucket used to store userdata and AWS Dynamodb script in S3. The second bucket will be used for ALB to store logs. IAM roles.
data.aws_ssm_parameter.s3bucket: S3 bucket information to storage scripts
aws_vpc.app_vpc: VPC for environment
aws_eip.lb_eip: Elastic IP address for Load balancer
aws_iam_role.app_s3_dynamodb_access_role: Role for EC2 instnace profile
data.aws_availability_zones.azs: To get list of all availability zones
data.aws_ssm_parameter.accesslogbucket: S3 bucketname to storage ALB logs
aws_iam_role_policy.app_s3_dynamodb_access_role_policy: Policy to attach on role “app_s3_dynamodb_access_role”. Dynamodb full access is granted please grant appropriate access for your application need
aws_iam_instance_profile.app_instance_profile: EC2 instance profile to access S3 storage and Dynamodb table
aws_subnet.app_subnets: Multiple subnets are created with VPC per Availability zone in region
aws_lb_target_group.app-lb-tg: Target group for ALB
aws_security_group.app_sg_allow_public: Security group for LB. Port 80 open from all world.
aws_internet_gateway.app_ig: Internet gateway
aws_lb.app-lb: Application load balancer
app_s3_dynamodb_access_role : To access Dynamodb and S3 account from Lambda function
aws_route_table.app_rt: Route table
aws_security_group.app_sg_allow_localip: Security group to allow ssh access from “localip” from variables file and ALB to access EC2 instance over port 80
aws_instance.app-web: This is template instance will be used for AMI creation used for Launch configuration and Autoscaling group (ASG)
aws_lb_listener.app-lb_listner: ALB Listner for healthcheck
aws_ami_from_instance.app-ami: AMI resource will create ami from “app-web” instance. Will use this ami to create launch configuration.
aws_launch_configuration.app-launch-config: EC2 instnace launch configuration used to create Autoscalling group.
aws_autoscaling_group.app-asg: Autoscaling group used create two instance in different availability zone. ALB will send request on these ASG.
aws-userdata-script.sh : This will will run during userdata is executed. File will get information list instance-id, Publicip, lcoal ip and Availability zone name from metadata server and copy that to “/var/www/html/index.html” file.
nps_parks.csv : Is input file to copy data from S3 and add into dynamodb table
dynamodb.py : file used above input file and create new table and insert a record into the table. This table now used to sorting and output is store again in “/var/www/html/index.html” for future view. Objecting is to ensure instances from different availability zones able to comminited to Database our 3rd layer.
user_data.tpl : Userdata template file used by terraform
terraform.tfvars : Terraform varible file
main.tf : Terraform program file
PS. I don’t want to use this story to create a full-blown application.
Download all files from the Github repository.
Download “terraform” software and copy at same downloaded location
Create S3 bucket to store scripts. Create “userdata” directory in bucket toplevel and upload “aws-userdata-script.sh”, “nps_parks.csv” and “dynamodb.py” file at that location. script “EC2 instance will copy these script using user-data template file.
Create key pair for EC2 instance.
Create the following parameter —
accesslogbucket : <buckname for ALB logs> You can use same bucket name as userdata.
ec2_keyname : <Key pair name>
s3bucket : s3://<bucketname>. Please ensure to prefix “s3://” before bucket name in value parameter.
Starting running terraform template you will see below output
The output is the Load balancer output link. You can add this output to DNS records for future access. For this exercise, we will use this address directly to access our application.
Load balancer configuration. DNS name to access your ALB endpoint. VPc, Availability zone, and security group configuration. The public security group will be used to get traffic from world to ALB on port 80. image-5 has information about S3 location where ALB will going to save logs.
ALB target group configuration and health check details. Healthcheck is performed on “/” parent page. This can be changed as per different application endpoints. Image-7 has information about instances registered to the target group via the Autoscaling group.
I am first creating a sample instance “ya-web”. Using this application to create “golden-ami”. This AMI is been used for launch configuration and to create the Autoscaling Group(ASG). Normally golden AMI already created. That AMI information can be inputted as a variable in “terraform.tfvars” files. image — 9 is the Autoscaling group configuration. Minimum/maximum capacity can be altered as part of input as well.
Instance information. “ya-web” is a template vm. Other two vis are part of autoscaling group.
Accessing application with a Load Balancer. LB transferred the request to the first instance in AZ “us-west-2a”. Instance able to pull data from DynamoDB using boto API and because of instance profile, we created in our resource file. The image-12 request is transferred to a 2nd instance for different AZ “us-west-2b”. I am using stickiness for 20sec. This can be managed via cookies as well. My idea of the application is make it a simple kind of “hello world” application to get the bare minimum configuration.
Instance public IPs are not able to access from outside world(image — 13). Only ssh and ping(icmp) are allowed from localip defined variables file.
Network security and Identity security needed to be improved for production use.