Skip to main content

Top 5 Techniques to Clean the Data and Make it useful for Analysis



Top 5 ways to Clean the Data and Make it useful for Analysis


Data Science is a highly trending and most popular technology of the 21st century. To become a skillful data scientist or a data analyst proficiency in statistics, mathematics, and knowledge of data analytics tools required. 
Data is everywhere internet, multimedia, images and many more but to manage the data and gains some insightful information from it a quite tedious task and requires proper analysis and experience in this field. The most challenging task is to clean the data and make it in a structured format. 
















There are various ways to clean data and become a proficient Data Analyst. Because 80% of the time is utilized in cleaning the data. So its a challenging and important task.


Sources of missing Values in data

  1. Programming Error.
  2. The user forgot to fill it.
  3. Data was lost while transferring manually from the legacy database.
  4. User choices not to fill it for furthermore fix and evaluation.
So these are some basics and important reasons for data missing but it's important. But the major missing from a statistics point of view is quite tedious and important for larger datasets analysis(organizations data, social-media data, etc.). So the statistics point of view data cleaning is major task. TO clean data you have made proper analysis and visualizations of the datasets. 


1. Using Pandas Python library you easily fill NaN(Not a number) is unfilled places in your datasets. But careful not fill in many missing columns as NaN because it makes your data Noisy and prediction might be gone wrong.

# Importing libraries
import pandas as PD
import numpy as np

# Read csv file into a pandas dataframe
df = pd.read_csv("filepath")

# Take a look at the first few rows
print df.head()


The syntax to check value is null or not. df = data frame(in Pandas)
print df['COLUMN_NAME'].isnull()

 2. Some times you delete the complete row with the many missing values but some times it replaces with NaN or some approximate value(int or float) min to 10~40 missing in smaller datasets.

# Replace missing values with a number

df['COLUMN_NAME'].fill(125, inplace=True)

3. Summarizing standard missing values if something important missing without that it affects the analysis result(calculate mean, %25, %75, the standard deviation of the column with the given value and match approximation).

4. Fill dummy values in place of missing values if the continuous value.


Conclusion

Data cleaning is a challenging task and dealing with messy data is tedious. Buts it's more important for better analysis.
In this article, we cover some ways and methodologies to deal with messy data. Using this technique I think it's helpful to spend less time on data cleaning and more on analysis.

Any suggestions or techniques you like please write in the comment section!!








Comments

Popular posts from this blog

Machine Learning and It's Types

                           Machine Learning and It's Types                                 Machine Learning is ability to automatically learn and improve from experience without being explicitly programmed. So rather than typing the code for all the times and do knowledge engineering, machine learning helps the machine  to learn from previous data and find insights and pattern from it.  Basically Data is train on given data set and and applied machine learning algorithm and it find insights. Simply put, Machine learning makes a computer act and think like a human. Types of machine learning           Supervised Learning In supervised learning you use labeled data,which is a data set that has been classified, to infer a learning algorithm. The data set is used as the basis for predicting the classification of other unlabeled data through the use of machine learning algorithms. Supervised and Unsupervised learning   Uns

When to Use HeatMap plot for Visualization of Data

HeatMap (Matrix) Plot Visualization for the Data: When to Use? Visual representation always helps in simplification either any real world entities or the data. Visualization  provides an pictorial representation so anyone can easily understand about the data and their insights(what they are representing and in which range the value is lying.                                                                                                                                                             Source: HeatMap Now when the data science becomes one of the popular domain in Computer science. It makes a big impact both in technology domain and in industries. Every industries now a days wants to find insights about their business data that are generated daily and improve and grow their business accordingly. So the data science jobs now become very trending. To make a complete analysis of data one's should many times go through visualization phase. Because everyone is not a good statist

Artificial Intelligence Transforms the World by Automating the Industries

              Artificial intelligence transforming the world slowly. The self-driving car, Amazon Alexa, IBM Watson, Google voice assistant all these are the few major examples of AI-powered system. The current impact of artificial intelligence makes it's a major field of study for computer science students regarding the future because there is a huge demand for machine learning and Artificial intelligence engineers and researchers in industry. By making everything automatic(self-learning technique) through computation it changes the world slowly. The current scenario of artificial intelligence is highly trending and many of the top multi-national companies acquire this technology to improve their business as well as more production. The one of core part of AI i.e. machine learning which is also also playing a majore role in this growth. . https://www.searchenterpriseai.techtarget.com After seeing the huge demands of machine learning and Artificial Intellig