Deep dive into Data, Types, and Processing methods

By: Microtek Learning

211

Deep dive into Data, Types, and Processing methods

What is Data

Data is a collection of raw details such as numbers, words, or sentences. These data are then processed and made into meaningful information which are used in making decisions. Data is majorly classified into three types, Structured Semi-structured and unstructured data.

Structured Data:

Structured data as name implies has a defined construct or data model and it is represented in form of columns and tuples in tabular format. All the data entered in these tables follow the same order and are consistent. This makes it easier to be accessed. This is mathematically described as inter-related set of data stored in form of tables and this form of database is categorized as Relational databases. Structured Query Language is majorly used to do data manipulations to these databases. These tables contain some unique column or collection of unique columns that can be used to find any rows. These are called as Primary key.

Popular examples for structured data are: Databases, Microsoft Excel, OLTP systems, program logs, etc.

In this article we’ll use Patient database as an illustration. Below is a sample Patients table, each row has a set of columns related to it. There are totally 7 rows describing details of 7 different patients and there are totally 9 columns detailing about the patients and their treatment visits. Here, Patient ID, Pat_number are unique columns that can be used to find any stored patient data.

Patient ID

Pat_number

Centre

Country

Age

Category

visit_number

Visit_Type

Vist_Date

1001

1241001

1241

AUS

26

Adult

1

Screening

13-Dec-2020

1002

1241002

1241

AUS

54

Senior

3

Visit 3

1-Apr-2020

1003

1241003

1241

AUS

13

Pediatrics

2

Rand

26-Aug-2020

1004

1241004

1241

AUS

33

Adult

5

Visit 5

16-Nov-2020

1005

1241005

1241

AUS

61

Senior

10

Completion

7-Jun-2020

1006

1241006

1241

AUS

20

Adult

4

Visit 4

18-Feb-2020

1007

1241007

1241

AUS

7

Pediatrics

6

Visit 6

19-May-2020

Semi-Structured Data:

Semi- Structured data does not have a definite construct or data model to which the data can be stored into. But they have some sort of structure to it which still makes it easier to be accessed or stored. Unlike structured data, they cannot be stored in form of columns and rows as tables. In this type, data entities are grouped together, and they are managed as hierarchy. The data within a group may or may not have similar attributes or properties. This is one of the main reasons why semi-structured data is not used in computer programming. Popular examples for structured data are: E-mails, XML files, web forms, HTML and other Markup languages, JSON files, etc.

Below is a sample JSON file which has two child documents. The two child documents still follow some structure and can be uniquely identified with ‘PatientID’ although the data fields are completely different.

 

## Document 1 ##

{

               “PatientID”: “1001”,

               “DOB”:

               {

                              “day”: “26”,

                              “month”: “October”,

                              “year”: “1995”

               }

}

 

## Document 2 ##

{

               “PatientID”: “1002”

               “Centre”: “1241”,

               “Country”: “AUS”

               “Phone”:

               {

                              “Personal”: “+41-09786“,

                              “Mobile”: “72563416”

               }             

}

There’s another type of Semi-structured data is Key-Value pairs. This is very much alike structured row-column format except each row can hold any number of columns.

Un-Structured Data:

This data does not have any sort of format or data model. Thus, making it completely impossible to format and confine them in row or column format. This type of data has its own pros and cons. As it does not have a defined format it is very flexible to store. Also, its easily scalable and need not be formatted to fit in to tables. Although, unstructured date is maximum not used in computer programming because of their nature and when stored they are very difficult to retrieve the data unless data cleansing or data transformation process are carried out on them. This type of data falls under NoSQL database category.

Data Processing methods

Popular examples for structured data are: Audio, video files, Image files etc.

Data Processing is nothing but the method of transforming raw data into more meaningful format.

This process makes data storage, retrieval, and data manipulation process very simple. Based on the system, requirement and how the data arrives this processing is majorly classified into two major processes. This process helps in effective CPU utilization. For instance, we can configure this process in such a way the data can be processed during ideal CPU time. In that way processing would happen comparatively in shorter amount of time and CPU utilization is managed.

Batch Processing:

In this data processing method, data entities are collected into groups. These groups are later processed on specific period, which is determined based on multiple number of ways as below,

    • Processing of a group can be defined to be performed during scheduled time period. For instance, accumulated data can be processed every hour or once in 5 hours.

    • Data processing can be a result of any other event.

    • This processing can also be triggered when a specific amount of data is accumulated or collected.

Although batch processing helps in processing large amount of data during our convenient time this process also has certain cons. There will always be time gap between data being ingested and getting the actual transformed data. The input data must be monitored carefully as even minor error such as typographical mistakes can cause the entire processing to stop. So, its always important to carefully feed the input data for batch processing. So, batch processing does not handle real-time data.

Data Streaming:

Unlike groups in batch processing, in stream processing every new data processed at the time of arrival. This makes data ingestion and transformation simultaneously and we would be able to access the transformed data instantaneously. This makes streaming handle real-time data.

Some popular examples would include real time gaming, Netflix or any other movie, audio streaming, etc.

Major difference between Batch processing and Streaming data

  Batch Processing Streaming
Data Size Batch is intended for large datasets Streaming process is intended for small chunks of data at a time
Data Scope Batch data processing can process all data at once from a dataset Streaming can process only the most recent data typically for small window time (like 30 seconds maximum)
Performance Typically, this process takes few hours of data latency time Few seconds to minutes of data latency time for processing as the amount of data is small.
Analysis Majorly used for complex analytics Generally used for simpler functions and calculations
Data Type Batch processing does not process real time data Streaming always works on real time data

 

For more educational information visit our site here.

 

Leave a message here