Building Data Pipelines on Google Cloud Platform

How to Build Data Pipelines on Google Cloud Platform? In today’s digital world, gigs of data get churned out every day. The data might include information essential for businesses to thrive, government to function, and for us to receive the right products and services we ordered from an online marketplace.

As an entrepreneur, businessman/businesswoman in this century, you might have already considered hiring a data analyst to analyze and process the collected data and transform your business.

To process these data, data analysts used data pipelines. But what do we mean by data pipelines, their features, and how can we use cloud platforms like Google to build these pipelines?

This article will help you understand everything about data pipelines, so without further ado, let’s get started!

What is a Data Pipeline?

“Pipeline”, in general, refers to the system of big pipes moving resources like natural gas or oil from one place to another. Undoubtedly, these pipelines are faster means of carrying large amounts of material over large distances.

Read: What is Data Pipeline Architecture

Similarly, data processing pipelines act as a backbone working on the same principle for data ingestion. Data pipelines are the set of steps for data processing where the data is ingested at the initial stage of the pipeline if that data has not been stored in the data platform already. The pipeline defines what, where, and how the data will be collected.

Simply put, there are step series where each step provides an output that acts as an input for the next one, this continues till the pipeline is complete.

Moreover, a data pipeline includes three elements, namely a source, processing steps, and a destination (sink). With data pipelines, it becomes easier to transfer data from an app to a data warehouse, or data lake to an analytics database. It is possible to have the same source and destination for data pipelines as well, and that data pipeline will be purely there for modification of the previous data set. A data pipeline might also have filtering and resilience for better performance.

Types of Data Pipeline

Data pipelines are divided into Batch Processing and Streaming data.

Batch Processing Data Pipelines

In batch processing data pipelines, “batches of data” are loaded into the repository at the same time intervals, often scheduled during low-peak business hours. The batch is then queried by the software program or user when it is ready for processing, allowing them to explore and visualize the data.

Batch processing tasks create a sequence commands workflow, i.e., the output of one command becomes the input of the next one. One command may trigger column filtering, and the next may work on data aggregation, for instance.

Batch processing is the optimal data pipeline used when there is not an immediate requirement for dataset analysis.

Streaming Data Pipelines

Streaming data is used when there is a near real-time data processing requirement. Unlike batch processing, it is about deriving insights from the data within milliseconds by ingesting data sets as they are created and continuously updating reports, metrics, or summaries in response to every event.

Read: Top 5 Data Streaming Tools

It enables organizations to gain real-time analytics to get updated information on operations to act without delay. Streaming data pipelines are better used for social media or point or sales apps to update data and information instantly.

Building Data Pipelines on Google Cloud Platform

Data Pipeline Elements

Understanding the elements of a data pipeline will help you understand how it works. So, let’s take a brief look at these data pipeline components.

Read: What is DataOps

Source:

The source is the entry point of the data pipeline. The source can be a storage system of a company like a data lake, data warehouse, etc., or other data sources such as IoT devices, APIs, transaction processing systems, and social media.

Destination:

The destination is the final point of the data pipeline where all the collected data from the source gets stored. Most often than not, a data warehouse or data lake acts as the destination.

Dataflow:

Dataflow refers to the entire movement and changes data undergoes while transferring from its source to destination.

Processing:

Processing refers to the steps or activities involved to extract or ingest data from sources, its transformation, and moving it to the destination. It decides how the movement of dataflow should be implemented.

Workflow:

In the data pipeline, workflow focuses on defining the process sequence and its dependencies.

Monitoring

Working with the data pipeline requires continuous monitoring to ensure data integrity and potential data loss. Other than that, monitoring the data pipeline helps to check if the efficiency of the pipeline is affected by increasing data load.

Now that we have a better knowledge of data pipelines, it would be beneficial to understand what Google Cloud Platform (GCP) is before we move ahead to building data pipelines on GCP.

Google Cloud Platform - An Overview

Google Cloud Platform is a cloud computing services suite, running on the same infrastructure used by Google internally for its products like Google Drive, Gmail, or Google Search. GCP provides modular cloud services such as data storage, computing, machine learning, and data analytics along with its management tools.

Read: 5 Ways Cloud Computing Can Benefit Web App Development

Platform-as-a-Service (PaaS), Infrastructure-as-a-Service (IaaS), and serverless environments for computing are other examples of services that Google Cloud Platform offers.

Under the Google Cloud brand, Google has over 100 products. Some of the key services that we need to know are listed below.

App Engine
Google Kubernetes Engine
Cloud Functions
Compute Engine
Cloud Run
Cloud Storage
Cloud SQL
Cloud Bigtable
Cloud Spanner
Cloud Datastore
Persistent Disk
Cloud Memorystore
Local SSD
Filestore
AlloyDB
Cloud CDN
Cloud DNS
Cloud Interconnect
Cloud Armor
Cloud Load Balancing
Virtual Private Cloud
Network Service Tiers
DataProc
BigQuery
Cloud Dataflow
Cloud Composer
Cloud Dataprep
Cloud DataLab
Cloud Data Studio
Cloud Shell
Cloud APIs
Cloud AutoML
Cloud TPU
Cloud Console
Cloud Identity
Edge TPU

Methods to Build Data Pipelines on the Google Cloud Platform

Before creating data pipelines, make sure to add necessary IAM roles like datapipelines.admin, datapipelines.invoker, datapipelines.viewer to allow using certain operations respectively.

To create the data pipeline using Google Cloud Platform, access the data pipeline feature from its console. Then a setup page will open where you can allow listed APIs before creating data pipelines. Now you can either import a job or create a data pipeline.

Read: Principles of Web API Design

How to Build Data Pipelines on Google Cloud Platoform?

To create a data pipeline on the Google Cloud platofrm follow the following steps:

In the Google Cloud Console, go to Dataflow Pipeline and select ‘Create Data Pipeline’.
Provide a name to the data pipeline, fill in other parameters and template selections on the pipeline template.
For a batch job, you can provide a recurrence schedule for the pipeline.

Now to create a batch data pipeline give your project access to Cloud Storage Bucket and BigQuery dataset for storing input and output data while creating tables simultaneously.

Let’s take an example pipeline that reads CSV files from storage (source), runs a transformation, and then stores the value in the BigQuery table (destination) with three columns.

Now, create the below-mentioned files in the local drive:

A big-query-column-table.json that will contain the destination schema as

bq query –nouse_legacy_sql ”

CREATE TABLE attendence_data.current_attendence (

employee_name string,

employee_id string,

attendance_count int64

)”

A transformation.js JavaScript file to implement a simple data transformation

function transform(line) {
var values = line.split(',');
var obj = new Object();
obj.employee_name = values[0];
obj.employee_id = values[1];
obj.attendence_count = values[2];
var jsonString = JSON.stringify(obj);
return jsonString;
}

A record01.csv CSV file with records to be inserted in the BigQuery table

Kayling, 65487, 30
Scarlet, 65878, 31
Frank, 45781, 28
Tyler, 63679, 29
Elena, 54876, 25
Stefan, 54845, 30
Markus, 69324, 28
Adelyn, 54751, 31
Jonas, 54875, 27
Blaze, 48721, 31

Use gsutil and copy the JSON and JS files in Cloud Storage Bucket ID Attendance-Record of your project and CSV file to Bucket ID Inputs as

gsutil cp big-query-column-table.json gs://BUCKET_ID/attendence-record/
gsutil cp transformation.js gs://BUCKET_ID/attendence-record/

gsutil cp record01.csv gs://BUCKET_ID/inputs/

After creating a record folder in Cloud Storage, create an attendance-record pipeline by entering the pipeline name, source, and destination, selecting “Text Files on Cloud Storage to BigQuery” under process data in bulk batch, and scheduling the pipeline based on your needs.

Other than the batch data pipeline, you can also create a streaming data pipeline based on the batch pipeline instructions, but remember the differences given below:

Streaming data pipelines do not have schedules speciṣfied for the Pipeline schedule, as the Dataflow streaming begins immediately.
Go to Process Data Continuously (stream) and then Text Files on Cloud Storage to BigQuery, when using the template for Dataflow.
If you use the Worker machine type, the pipeline processes the records you upload to the inputs/ folder that match the pattern gs://BUCKET_ID/inputs/record01.csv. To avoid out-of-memory errors when CSV files exceed several GBs, select a machine type with more memory than the default n1-standard-4 machine type.

Conclusion

So that was all about the data pipeline and Google Cloud Platform. And this is how you can easily create a simple yet featured data pipeline using GCP. Remember, there is no exception handling mentioned above, so while working for an organization you would have to include it yourself.