Learning Data Science: Day 4 - Probability, R, and Kaggle
Some of you might wonder that yesterday I’m not releasing a story. One of the reason is that I’m realizing I still lack knowledge in statistics. So, for the last 2 days, I’m trying to understand the basic concept of statistics and also learning about other things as well.
Probability
I’m decided to search for books regarding statistics for data science especially for people who have a background as a computer scientist, just like me. It turns out I can’t find any, thus I broaden the topic into “statistics for data science”. It turns out I found more compelling search results this time. One that I’m interested with is from the link below.
Free statistical textbooks
Are there any free statistical textbooks available?
stats.stackexchange.com
The interesting part of the link I just provided is that all of the books are free. The book that I try to read is Michael Lavine’s one titled Introduction to Statistical Thought. In my opinion, the explanation is pretty simple and pretty straightforward, the code snippets in R are also pretty helpful in understanding how the concept and the implementation works. However, the downside of using this book is sometimes the explanation is too simple and I can’t figure out what it means. So, I still need to search it online to understand better. Up until today, I already learn about probability, this includes the basic of probability, probability densities, parametric families of distributions, joint probability, marginal probability, and conditional probability. However there are still a few materials from probability chapter that I’ve not covered yet, you can search for the remaining one.
R
It would be pointless to learn statistics without knowing how to implement it. So I recommend you to try as you learn it with R and even python. For installing R you have two options. You can either install R and R-Studio or install r-essentials in conda, to enable R kernel in Jupyter Notebook.
R itself is available in the link below, while most people would recommend using R studio. However, I can’t really ensure does using R studio really have more beneficial since I choose to use R kernel in Jupyter Notebook.
For people who prefer to use Jupyter Notebook and install it using Anaconda, the installation is pretty easy. You can just run conda install r-essentials
and the conda will install it for you. Pretty easy right?
Kaggle
Some people say that the best way to learn is to learn by trying. So, I’m trying to work on one of the data sets about Titanic. Why this dataset? Some people said that titanic data set is the best way to start. One of the advantages of trying on Titanic data set is the availability of tutorials using excel, python, and R. Yes, you read it right, there is a tutorial for titanic dataset using excel. So it would be an ideal starter. The titanic data set itself is available at the link below.
Final Words
So that’s what I’ve been learning on for the last 2 days. I hope you can learn a thing or two from what I’ve learned. I’m still waiting for discussion regarding this topic, so If you have suggestions, responses, or comments regarding how you learn probability, R, or where you found your data sets for learning I’m really looking forward to it.
Update #1: This story has been updated to enhance the content and some illustration. Happy new year everyone!