Deep dive into Data, Types, and Processing methods

Data Types & Processing

What is Data

Data is a collection of raw details such as numbers, words, or sentences. These data are then processed and made into meaningful information which are used in making decisions. Data is majorly classified into three types, Structured Semi-structured and unstructured data.

Structured Data:

Structured data as name implies has a defined construct or data model and it is represented in form of columns and tuples in tabular format. All the data entered in these tables follow the same order and are consistent. This makes it easier to be accessed. This is mathematically described as inter-related set of data stored in form of tables and this form of database is categorized as Relational databases. Structured Query Language is majorly used to do data manipulations to these databases. These tables contain some unique column or collection of unique columns that can be used to find any rows. These are called as Primary key.

Popular examples for structured data are: Databases, Microsoft Excel, OLTP systems, program logs, etc.

In this article we’ll use Patient database as an illustration. Below is a sample Patients table, each row has a set of columns related to it. There are totally 7 rows describing details of 7 different patients and there are totally 9 columns detailing about the patients and their treatment visits. Here, Patient ID, Pat_number are unique columns that can be used to find any stored patient data.

Patient IDPat_numberCentreCountryAgeCategoryvisit_numberVisit_TypeVist_Date
100112410011241AUS26Adult1Screening13-Dec-2020
100212410021241AUS54Senior3Visit 31-Apr-2020
100312410031241AUS13Pediatrics2Rand26-Aug-2020
100412410041241AUS33Adult5Visit 516-Nov-2020
100512410051241AUS61Senior10Completion7-Jun-2020
100612410061241AUS20Adult4Visit 418-Feb-2020
100712410071241AUS7Pediatrics6Visit 619-May-2020

Semi-Structured Data:

Semi- Structured data does not have a definite construct or data model to which the data can be stored into. But they have some sort of structure to it which still makes it easier to be accessed or stored. Unlike structured data, they cannot be stored in form of columns and rows as tables. In this type, data entities are grouped together, and they are managed as hierarchy. The data within a group may or may not have similar attributes or properties. This is one of the main reasons why semi-structured data is not used in computer programming. Popular examples for structured data are: E-mails, XML files, web forms, HTML and other Markup languages, JSON files, etc.

Below is a sample JSON file which has two child documents. The two child documents still follow some structure and can be uniquely identified with ‘PatientID’ although the data fields are completely different.

## Document 1 ##

{

“PatientID”: “1001”,

“DOB”:

{

“day”: “26”,

“month”: “October”,

“year”: “1995”

}

}

## Document 2 ##

{

“PatientID”: “1002”

“Centre”: “1241”,

“Country”: “AUS”

“Phone”:

{

“Personal”: “+41-09786“,

“Mobile”: “72563416”

}

}

There’s another type of Semi-structured data is Key-Value pairs. This is very much alike structured row-column format except each row can hold any number of columns.

Un-Structured Data:

This data does not have any sort of format or data model. Thus, making it completely impossible to format and confine them in row or column format. This type of data has its own pros and cons. As it does not have a defined format it is very flexible to store. Also, its easily scalable and need not be formatted to fit in to tables. Although, unstructured date is maximum not used in computer programming because of their nature and when stored they are very difficult to retrieve the data unless data cleansing or data transformation process are carried out on them. This type of data falls under NoSQL database category.

Data Processing methods

Popular examples for structured data are: Audio, video files, Image files etc.

Data Processing is nothing but the method of transforming raw data into more meaningful format.

This process makes data storage, retrieval, and data manipulation process very simple. Based on the system, requirement and how the data arrives this processing is majorly classified into two major processes. This process helps in effective CPU utilization. For instance, we can configure this process in such a way the data can be processed during ideal CPU time. In that way processing would happen comparatively in shorter amount of time and CPU utilization is managed.

Batch Processing:

In this data processing method, data entities are collected into groups. These groups are later processed on specific period, which is determined based on multiple number of ways as below,

  • Processing of a group can be defined to be performed during scheduled time period. For instance, accumulated data can be processed every hour or once in 5 hours.
  • Data processing can be a result of any other event.
  • This processing can also be triggered when a specific amount of data is accumulated or collected.

Although batch processing helps in processing large amount of data during our convenient time this process also has certain cons. There will always be time gap between data being ingested and getting the actual transformed data. The input data must be monitored carefully as even minor error such as typographical mistakes can cause the entire processing to stop. So, its always important to carefully feed the input data for batch processing. So, batch processing does not handle real-time data.

Data Streaming:

Unlike groups in batch processing, in stream processing every new data processed at the time of arrival. This makes data ingestion and transformation simultaneously and we would be able to access the transformed data instantaneously. This makes streaming handle real-time data.

Some popular examples would include real time gaming, Netflix or any other movie, audio streaming, etc.

Major difference between Batch processing and Streaming data

Batch ProcessingStreaming
Data SizeBatch is intended for large datasetsStreaming process is intended for small chunks of data at a time
Data ScopeBatch data processing can process all data at once from a datasetStreaming can process only the most recent data typically for small window time (like 30 seconds maximum)
PerformanceTypically, this process takes few hours of data latency timeFew seconds to minutes of data latency time for processing as the amount of data is small.
AnalysisMajorly used for complex analyticsGenerally used for simpler functions and calculations
Data TypeBatch processing does not process real time dataStreaming always works on real time data

For more educational information visit our site here.

You might also like

Leave a Reply

Your email address will not be published. Required fields are marked *