Skip to main content

Top 5 Techniques to Clean the Data and Make it useful for Analysis



Top 5 ways to Clean the Data and Make it useful for Analysis


Data Science is a highly trending and most popular technology of the 21st century. To become a skillful data scientist or a data analyst proficiency in statistics, mathematics, and knowledge of data analytics tools required. 
Data is everywhere internet, multimedia, images and many more but to manage the data and gains some insightful information from it a quite tedious task and requires proper analysis and experience in this field. The most challenging task is to clean the data and make it in a structured format. 
















There are various ways to clean data and become a proficient Data Analyst. Because 80% of the time is utilized in cleaning the data. So its a challenging and important task.


Sources of missing Values in data

  1. Programming Error.
  2. The user forgot to fill it.
  3. Data was lost while transferring manually from the legacy database.
  4. User choices not to fill it for furthermore fix and evaluation.
So these are some basics and important reasons for data missing but it's important. But the major missing from a statistics point of view is quite tedious and important for larger datasets analysis(organizations data, social-media data, etc.). So the statistics point of view data cleaning is major task. TO clean data you have made proper analysis and visualizations of the datasets. 


1. Using Pandas Python library you easily fill NaN(Not a number) is unfilled places in your datasets. But careful not fill in many missing columns as NaN because it makes your data Noisy and prediction might be gone wrong.

# Importing libraries
import pandas as PD
import numpy as np

# Read csv file into a pandas dataframe
df = pd.read_csv("filepath")

# Take a look at the first few rows
print df.head()


The syntax to check value is null or not. df = data frame(in Pandas)
print df['COLUMN_NAME'].isnull()

 2. Some times you delete the complete row with the many missing values but some times it replaces with NaN or some approximate value(int or float) min to 10~40 missing in smaller datasets.

# Replace missing values with a number

df['COLUMN_NAME'].fill(125, inplace=True)

3. Summarizing standard missing values if something important missing without that it affects the analysis result(calculate mean, %25, %75, the standard deviation of the column with the given value and match approximation).

4. Fill dummy values in place of missing values if the continuous value.


Conclusion

Data cleaning is a challenging task and dealing with messy data is tedious. Buts it's more important for better analysis.
In this article, we cover some ways and methodologies to deal with messy data. Using this technique I think it's helpful to spend less time on data cleaning and more on analysis.

Any suggestions or techniques you like please write in the comment section!!








Comments

Popular posts from this blog

Rising of the AI in the human centric Development

Rising of the AI in the human centric Development The rising of the artificial intelligence in later 90's have make a rapid impact in field of technology and from 21st century the blooming of a mechanism makes several impact in various industries including software, education, healthcare and many more. As the world becomes increasingly reliant on technology, the role of  artificial intelligence  (AI) in human-centric development has risen to the forefront. From healthcare to transportation to education, AI is being used to improve the lives of people around the globe. Major areas where artificial intelligence AI makes an Impact One area where AI has made significant strides is in the healthcare industry. AI-powered virtual assistants can now assist doctors in diagnosing and treating patients, freeing up valuable time for medical professionals. In addition, AI-powered wearable devices can track a person's health and alert them to any potential issues. The transportation industr...

How to calculate Running Time of an algorithm

                                            Calculate Running Time of an Algorithm The running time of algorithm defines the time required to execute an algorithm on the given set of inputs(n). There are mainly three types of complexity cases defines to measure the running time of an algorithm also known as Asymptotic analysis. 1) Best Case : Best case also called ( Ω) omega  notation which measure the best case scenario of how long an algorithm can possible take to complete given operation on (n) inputs. It's also known as lower bound. 2) Average Case : It represents by ( Θ) theta  notation which measure the average time requires to complete a given operation on set of inputs. It measures between upper and lower bound running time and calculate average running time. 3) Worst Case: It defines the worst case running time of an algorithm. Also represent using ( Ο) Big-o...

When to Use HeatMap plot for Visualization of Data

HeatMap (Matrix) Plot Visualization for the Data: When to Use? Visual representation always helps in simplification either any real world entities or the data. Visualization  provides an pictorial representation so anyone can easily understand about the data and their insights(what they are representing and in which range the value is lying.                                                                                                                                                             Source: HeatMap Now when the data science becomes one of the popular domain in Computer science. It m...