What is Data
Data is a collection of raw details such as numbers, words, or sentences. These data are then processed and made into meaningful information which are used in making decisions. Data is majorly classified into three types, Structured Semi-structured and unstructured data.
Structured data as name implies has a defined construct or data model and it is represented in form of columns and tuples in tabular format. All the data entered in these tables follow the same order and are consistent. This makes it easier to be accessed. This is mathematically described as inter-related set of data stored in form of tables and this form of database is categorized as Relational databases. Structured Query Language is majorly used to do data manipulations to these databases. These tables contain some unique column or collection of unique columns that can be used to find any rows. These are called as Primary key.
Popular examples for structured data are: Databases, Microsoft Excel, OLTP systems, program logs, etc.
In this article we’ll use Patient database as an illustration. Below is a sample Patients table, each row has a set of columns related to it. There are totally 7 rows describing details of 7 different patients and there are totally 9 columns detailing about the patients and their treatment visits. Here, Patient ID, Pat_number are unique columns that can be used to find any stored patient data.
Semi- Structured data does not have a definite construct or data model to which the data can be stored into. But they have some sort of structure to it which still makes it easier to be accessed or stored. Unlike structured data, they cannot be stored in form of columns and rows as tables. In this type, data entities are grouped together, and they are managed as hierarchy. The data within a group may or may not have similar attributes or properties. This is one of the main reasons why semi-structured data is not used in computer programming. Popular examples for structured data are: E-mails, XML files, web forms, HTML and other Markup languages, JSON files, etc.
Below is a sample JSON file which has two child documents. The two child documents still follow some structure and can be uniquely identified with ‘PatientID’ although the data fields are completely different.
## Document 1 ##
## Document 2 ##
There’s another type of Semi-structured data is Key-Value pairs. This is very much alike structured row-column format except each row can hold any number of columns.
This data does not have any sort of format or data model. Thus, making it completely impossible to format and confine them in row or column format. This type of data has its own pros and cons. As it does not have a defined format it is very flexible to store. Also, its easily scalable and need not be formatted to fit in to tables. Although, unstructured date is maximum not used in computer programming because of their nature and when stored they are very difficult to retrieve the data unless data cleansing or data transformation process are carried out on them. This type of data falls under NoSQL database category.
Data Processing methods
Popular examples for structured data are: Audio, video files, Image files etc.
Data Processing is nothing but the method of transforming raw data into more meaningful format.
This process makes data storage, retrieval, and data manipulation process very simple. Based on the system, requirement and how the data arrives this processing is majorly classified into two major processes. This process helps in effective CPU utilization. For instance, we can configure this process in such a way the data can be processed during ideal CPU time. In that way processing would happen comparatively in shorter amount of time and CPU utilization is managed.
In this data processing method, data entities are collected into groups. These groups are later processed on specific period, which is determined based on multiple number of ways as below,
- Processing of a group can be defined to be performed during scheduled time period. For instance, accumulated data can be processed every hour or once in 5 hours.
- Data processing can be a result of any other event.
- This processing can also be triggered when a specific amount of data is accumulated or collected.
Although batch processing helps in processing large amount of data during our convenient time this process also has certain cons. There will always be time gap between data being ingested and getting the actual transformed data. The input data must be monitored carefully as even minor error such as typographical mistakes can cause the entire processing to stop. So, its always important to carefully feed the input data for batch processing. So, batch processing does not handle real-time data.
Unlike groups in batch processing, in stream processing every new data processed at the time of arrival. This makes data ingestion and transformation simultaneously and we would be able to access the transformed data instantaneously. This makes streaming handle real-time data.
Some popular examples would include real time gaming, Netflix or any other movie, audio streaming, etc.
Major difference between Batch processing and Streaming data
|Batch is intended for large datasets
|Streaming process is intended for small chunks of data at a time
|Batch data processing can process all data at once from a dataset
|Streaming can process only the most recent data typically for small window time (like 30 seconds maximum)
|Typically, this process takes few hours of data latency time
|Few seconds to minutes of data latency time for processing as the amount of data is small.
|Majorly used for complex analytics
|Generally used for simpler functions and calculations
|Batch processing does not process real time data
|Streaming always works on real time data
For more educational information visit our site here.