Top 5 ways to Clean the Data and Make it useful for Analysis
Data Science is a highly trending and most popular technology of the 21st century. To become a skillful data scientist or a data analyst proficiency in statistics, mathematics, and knowledge of data analytics tools required.
Data is everywhere internet, multimedia, images and many more but to manage the data and gains some insightful information from it a quite tedious task and requires proper analysis and experience in this field. The most challenging task is to clean the data and make it in a structured format.
There are various ways to clean data and become a proficient Data Analyst. Because 80% of the time is utilized in cleaning the data. So its a challenging and important task.
Sources of missing Values in data
- Programming Error.
- The user forgot to fill it.
- Data was lost while transferring manually from the legacy database.
- User choices not to fill it for furthermore fix and evaluation.
So these are some basics and important reasons for data missing but it's important. But the major missing from a statistics point of view is quite tedious and important for larger datasets analysis(organizations data, social-media data, etc.). So the statistics point of view data cleaning is major task. TO clean data you have made proper analysis and visualizations of the datasets.
1. Using Pandas Python library you easily fill NaN(Not a number) is unfilled places in your datasets. But careful not fill in many missing columns as NaN because it makes your data Noisy and prediction might be gone wrong.
# Importing libraries
import pandas as PD
import numpy as np
# Read csv file into a pandas dataframe
df = pd.read_csv("filepath")
# Take a look at the first few rows
print df.head()
The syntax to check value is null or not. df = data frame(in Pandas)
print df['COLUMN_NAME'].isnull()
2. Some times you delete the complete row with the many missing values but some times it replaces with NaN or some approximate value(int or float) min to 10~40 missing in smaller datasets.
# Replace missing values with a number
df['COLUMN_NAME'].fill(125, inplace=True)
3. Summarizing standard missing values if something important missing without that it affects the analysis result(calculate mean, %25, %75, the standard deviation of the column with the given value and match approximation).
4. Fill dummy values in place of missing values if the continuous value.
Conclusion
Data cleaning is a challenging task and dealing with messy data is tedious. Buts it's more important for better analysis.
In this article, we cover some ways and methodologies to deal with messy data. Using this technique I think it's helpful to spend less time on data cleaning and more on analysis.
Any suggestions or techniques you like please write in the comment section!!
Comments
Post a Comment