Learning Data Science: Day 6 - Data Exploration on Titanic Datasets
Today, we are going to move backward (sorry for not learning it in the right order) to learn about Data Exploration. The datasets we are going to use is still the same Titanic Datasets. So here’s what I learn about Data Exploration on Titanic Datasets.
Definition
Data exploration or sometimes also called descriptive analysis basically the first step you have to work on the first time. For most people, especially the beginner one (just like me!), data exploration will consume most of your time to work on datasets. So, is data exploration so important? Yes, one of the main reason is we have to make sure that there are no odd data on the dataset because it might cause bugs or inaccuracy in predicting. It is also important to ask questions about the data to identify the features that visible already and to understand more about the dataset itself. Some people, already automate this kind of step to save time. So here are few steps to do data exploration.
Variables Exploration
1. Shape
It is always a good idea to check for the shape of the dataset. The shape in this context is to know the number of columns and rows in the dataset. In pandas, you can easily check it by calling ‘shape’ property. It will return you the size of column multiplied by the size of the row of the data frame. It will return with something like this if we are calling the shape property on pandas: (889, 12). That means we have 12 column with 889 rows of data.
2. Data Types
After getting information about the shape of the datasets, we can check the data type for each column. In pandas, you can easily fetch it by calling ‘dtypes’ property. It will return with the list of data types in the dataset. The most common data types consist of int, float, and also object. In python, a string is identified as an object.
3. Check several rows of data
In the previous step you might find some column to have odd data types, if that’s the case that means you have to check on several samples of data. In pandas, we can use ‘head(int)’ method to check several values from an integer number of value we put onto the parameter. Make sure you found where’s the existing weird data. Most of the time, it is due to ‘NaN’ values our outlier values. We should handle it later on the data treatment. In this step you should also understand which column represents a categorical variable or continuous variable. Simple ways to identify them is when for example there are a column with possible value of data is either ‘female’ or ‘male’, then it must be a categorical variable. The other way around, when for example it is numerical values such as ‘176’ or ‘29.3’ then possibly it is a continuous variable.
4. Identify important statistical informations
This time, we already know whether a variable is categorical or continuous. It is important to know statistical information such as mean, median, mode, standard deviation, etc from a certain column. In pandas, we can call ‘describe()’ method to fetch all those information altogether, or if you prefer to call specific information you can use ‘mean()’, ‘mode()’ or other things you want to know
5. Identify Input and Target features
Now you have already know a bit of your data. It is the time to identify what are the input features and the target features that we want. In titanic case, it’s pretty easy to identify the target feature which is whether certain passengers survived or not. The input features are a bit tricky, you have to use information from previous steps to decide which features that can be used as predictors. Analysis such as univariate, bi-variate, or multivariate analysis can be used here.
What comes after that?
After all of those steps you should be okay and know some useful points about the dataset. Based on those information now we can move on to data treatment that available on previous story.
So that’s the least thing that I think we should do when doing data exploration, there might be missing steps that I don’t consider it’s important, so let me know about your opinion, suggestion, and any other kind of information regarding missing crucial steps or data exploration as the whole.