How can I become a data scientist?

Pronojit Saha

SELF STARTER WAY
For a self-starter novice, here is an outline that one can start with. (this is reproduced from my blog- How to acquire the “Essential Skill Set”?- the Self Starter way). The idea is to pick one or two resources (links) from each sub group and learn about the same.

0. Basic Pre-requisites:

Mathematics, Algorithms & Databases: Mathispower4u-Calculus,Coursera-Linear Algebra, Coursera-Analysis of Algorithms,Coursera- Introduction to Databases
Statistics: Probability and Statistics for Programmers, Statistical Formulas For Programmers, Coursera- Data Analysis, Coursera- Statistics One
Programming: Google Developers R Programming Lectures,Introduction to R – DataCamp, Scientific Python Lectures, How to Think Like a Computer Scientist

1. Acquire & Scrub Data:

2. Filter & Mine data:

Data Analysis in R: Data science in R, Coursera-Computing for Data Analysis in R
Data Analysis in Python (numpy, scipy, pandas, scikit): Getting Started With Python For Data Science, Introduction to NumPy -SciPyConf 2015, Statistical Data Analysis in Python, Pandas (1st Video Below),SciPy 2013- Introduction to SciKit Learn Tutorial I & II (2nd & 3rd Video Below)

Exploratory Data Analysis- Exploratory Data Analysis in R,Exploratory Data Analysis in Python, UC Berkeley: Descriptive Statistics, Basic Unix Shell Commands for the Data Scientist
Data Mining, Machine Learning:

Data Mining Map, Coursera – Machine Learning, Stanford – Statistical Learning, MITx: The Analytics Edge, STATS 202 Data Mining & Analysis,Learning From Data – CalTech, Coursera – Web Intelligence & Big Data

Big Data Machine Learning – AMP Camp Berkeley Spark Introduction & Exercises, EdX- Big Data Analysis with Apache Spark, Mining Massive Data Sets – Stanford

3. Represent & Refine Data: Tableau-Training & Tutorials, Data visualisation in R with ggplot2 and plyr, Predictive Analytics: Overview and Data visualization, Flowing Data-Tutorials, UC Berkeley-Data Visualization,D3.js Tutorial

4. Domain Knowledge: This skill is developed through experience working in an industry. Each dataset is different and comes with certain assumptions and industry knowledge. For example, a data analyst specializing in stock market data would need time to develop knowledge in analyzing transactional data for restaurants.

Combining all the above:
Data Literacy Course — IAP
Coursera – Introduction to Data Science
Coursera – Data Science Specialization

Books:
Elements of Statistical Learning
Python Machine Learning

Apply the knowledge:
Harvard Data Science Course Homework
Kaggle: The Home of Data Science
Analyzing Big Data with Twitter
Analyzing Twitter Data with Apache Hadoop

FORMAL WAY
For a more formal way of becoming a data scientist one can look into this post (reproduced below)- How to acquire the “Essential Skill Set”?- the Formal way.
The Essential Skill Set are the basic fundamental skills which every data scientist is expected to know. Traditionally, these can be acquired by undertaking a computer science degree or a statistics degree from an institution. The StanfordComputer Science courses & Statistics courses provide a good reference list of courses to undertake. Now some of the courses are relevant while many others are not. For example in Computer Science while one would do good to learn about large scale distributed databases & algorithms but there is no need for learning HCI and UX, or pureplay storage and operating systems, networking, etc. Similarly some statistics courses focus too much on, lets say, “old school statistics” including thousands of ways of hypothesis testing instead of more on machine learning (clustering, regression, classification, etc). So both the streams have many nice to have courses and must have courses for a data scientist (I dare to claim that at present the percentage of must have courses seems to be greater in a traditional Statistics stream than a Computer Science stream). As such one needs to pick the courses wisely.

Or alternatively, one can also look into a number of new Data Science courses that some universities are offering harping on the points I mentioned above. They combine the must have courses from both the traditional statistics and computer science program to impart the 4 Essential Skills as well as include courses to develop the Differentiator Skills in students. The MS in Data Science at NYU & MS in Analytics at USF are good examples of such amalgamation of the requisite courses. A complete list of such courses is presented here- Colleges with Data Science Degrees.

The correct program obviously depends on the individual’s goal. One of the recent O’Rielly publications titled ‘Analyzing the Analyzers’ does a very good job in aggregating the various data scientist roles into 4 main categories as per their skills. An individual may therefore select a program as per the category of data scientist he most identifies himself with, as shown below.

Data Businesspeople are the product and profit-focused data scientists. They’re leaders, managers, and entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an MBA or the new Data Science programs as mentioned above.
Data Creatives are eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may think of themselves as artists or hackers, and excel at visualization and open source technologies. They are expected to have a engineering degree (mostly in statistics or economics) but not much in business skills.
Data Developers are focused on writing software to do analytic, statistical, and machine learning tasks, often in production environments. They often have computer science degrees, and often work with so-called “big data”.
Data Researchers apply their scientific training, and the tools and techniques they learned in academia, to organizational data. They may have a MS or PhDs in statistics, economic, physics, etc., and their creative applications of mathematical tools yields valuable insights and products.

The skills associated with the 4 main categories, which justify the above mentioned program recommendation, are as below:

William Chen

Here are some amazing and completely free resources online that you can use to teach yourself data science.

Besides this page, I would highly recommend the Official Quora Data Science FAQ as your comprehensive guide to data science! It includes resources similar to this one, as well as advice on preparing for data science interviews. Additionally, follow the Quora Data Science topic if you haven’t already to get updates on new questions and answers!

Fulfill your prerequisites

Before you begin, you need Multivariable Calculus, Linear Algebra, and Python. If your math background is up to multivariable calculus and linear algebra, you’ll have enough background to understand almost all of the probability / statistics / machine learning for the job.

Multivariate Calculus: What are the best resources for mastering multivariable calculus?

Numerical Linear Algebra / Computational Linear Algebra / Matrix Algebra:Linear Algebra, Coursera (starts 2/2/2015)

Multivariate calculus is useful for some parts of machine learning and a lot of probability. Linear / Matrix algebra is absolutely necessary for a lot of concepts in machine learning.

You also need some programming background to begin, preferably in Python. Most other things on this guide can be learned on the job (like random forests, pandas, A/B testing), but you can’t get away without knowing how to program!

Python is the most important language for a data scientist to learn. To learn to code, more about Python, and why Python is so important, check out

If you’re currently in school, take statistics and computer science classes. Check out What classes should I take if I want to become a data scientist?

Plug Yourself Into the Community

Check out Meetup to find some that interest you! Attend an interesting talk, learn about data science live, and meet data scientists and other aspirational data scientists. Start reading data science blogs and following influential data scientists:

Setup and Learn to use your tools

Python

Install Python, iPython, and related libraries (guide)
How do I learn Python?

Install R and RStudio (I would say that R is the second most important language. It’s good to know both Python and R)
Learn R with swirl

Sublime Text

SQL

How do I learn SQL? (You can practice it using the sqlite package in Python)

Learn Probability and Statistics

Be sure to go through a course that involves heavy application in R or Python. Knowing probability and statistics will only really be helpful if you can implement what you learn.

Python Application: Think Stats (free pdf) (Python focus)
R Applications: An Introduction to Statistical Learning (free pdf)(MOOC) (R focus)
Print out a copy of Probability Cheatsheet

Complete Harvard’s Data Science Course

As of Fall 2015, the course is currently in its third year and strives to be as applicable and helpful as possible for students who are interested in becoming data scientists. An example of how is this happening is the introduction of Spark and SQL starting this year.

I’d recommend doing the labs and lectures from 2015 (since they’re the most current material), and the homeworks from 2013 (2015 homeworks are not available to the public, and the 2014 homeworks are written under a different instructor than the original instructors).

This course is developed in part by a fellow Quora user, Professor Joe Blitzstein. Here are all of the materials!

Intro to the class

Course Materials

Class main page: CS109 Data Science
Lectures, Slides, and Labs: Class Material

Assignments

Intro to Python, Numpy, Matplotlib (Homework 0) (Solutions)
Poll Aggregation, Web Scraping, Plotting, Model Evaluation, and Forecasting (Homework 1) (Solutions)
Data Prediction, Manipulation, and Evaluation (Homework 2) (Solutions)
Predictive Modeling, Model Calibration, Sentiment Analysis(Homework 3) (Solutions)
Recommendation Engines, Using Mapreduce (Homework 4) (Solutions)
Network Visualization and Analysis (Homework 5) (Solutions)

Labs

(these are the 2013 labs. For the 2015 labs, check out Class Material)

Do most of Kaggle’s Getting Started and Playground Competitions

I would NOT recommend doing any of the prize-money competitions. They usually have datasets that are too large, complicated, or annoying, and are not good for learning (Kaggle.com)

Start by learning scikit-learn, playing around, reading through tutorials and forums at Data Science London + Scikit-learn for a simple, synthetic, binary classification task. Next, play around some more and check out the tutorials forTitanic: Machine Learning from Disaster with a slightly more complicatedbinary classification task (with categorical variables, missing values, etc.)

Afterwards, try some multi-class classification with Forest Cover Type Prediction. Now, try a regression task Bike Sharing Demand that involves incorporating timestamps. Try out some natural language processing withSentiment Analysis on Movie Reviews. Finally, try out any of the other knowledge-based competitions that interest you!

Learn Some Data Science Electives

Product Metrics will teach you about what companies track, what metrics they find important, and how companies measure their success:The 27 Metrics in Pinterest’s Internal Growth Dashboard
Optimization will help you with understanding statistics and machine learning: Convex Optimization – Boyd and Vandenberghe
A/B Testing is just a rebranded version of what pharmaceutical companies have been doing for decades. Learn more about A/B testing here: How do I learn about A/B testing?
Visualization – I would recommend picking up ggplot2 in R to make simple yet beautiful graphics and just browsing DataIsBeautiful • /r/dataisbeautiful and FlowingData for ideas and inspiration.
User Behavior – This set of blogs posts looks useful and interesting –This Explains Everything ” User Behavior
Feature Engineering – Check out What are some best practices in Feature Engineering? and this great example:http://nbviewer.ipython.org/gith…
Big Data Technologies – These are tools and frameworks developed specifically to deal with massive amounts of data. How do I learn big data technologies?
Machine Learning How do I learn machine learning? This is an extremely rich area with massive amounts of potential. Andrew Ng’s Machine Learning course on Coursera is one of the most popular MOOCs, and a great way to start! Andrew Ng’s Machine Learning MOOC
Natural Language Processing – This is the practice of turning text data into numerical data whilst still preserving the “meaning”. Learning this will let you analyze new, exciting forms of data. How do I learn Natural Language Processing (NLP)?
Time Series Analysis – How do I learn about time series analysis?
Building a Data Culture – http://www.oreilly.com/data/free…

Do a Capstone Product / Side Project

Use your new data science and software engineering skills to build something that will make other people say wow! This can be a website, new way of looking at a dataset, cool visualization, or anything!

Create public github repositories, make a blog, and post your work, side projects, Kaggle solutions, insights, and thoughts! This helps you gain visibility, build a portfolio for your resume, and connect with other people working on the same tasks.

Get a Data Science Internship or Job

Check out The Official Quora Data Science FAQ for more discussion on internships, jobs, and data science interview processes! The data science FAQ also links to more specific versions of this question, like How do I become a data scientist without a PhD? or the counterpart, How do I become a data scientist as a PhD student?

Think like a Data Scientist

In addition to the concrete steps I listed above to develop the skill set of a data scientist, I include seven challenges below so you can learn to think like a data scientist and develop the right attitude to become one.

(1) Satiate your curiosity through data

As a data scientist you write your own questions and answers. Data scientists are naturally curious about the data that they’re looking at, and are creative with ways to approach and solve whatever problem needs to be solved.

Much of data science is not the analysis itself, but discovering an interesting question and figuring out how to answer it.

Here are two great examples:

Challenge: Think of a problem or topic you’re interested in and answer it with data!

(2) Read news with a skeptical eye

Much of the contribution of a data scientist (and why it’s really hard to replace a data scientist with a machine), is that a data scientist will tell you what’s important and what’s spurious. This persistent skepticism is healthy in all sciences, and is especially necessarily in a fast-paced environment where it’s too easy to let a spurious result be misinterpreted.

You can adopt this mindset yourself by reading news with a critical eye. Many news articles have inherently flawed main premises. Try these two articles. Sample answers are available in the comments.

Easier: You Love Your iPhone. Literally.

Harder: Who predicted Russia’s military intervention?

Challenge: Do this every day when you encounter a news article. Comment on the article and point out the flaws.

(3) See data as a tool to improve consumer products

Visit a consumer internet product (probably that you know doesn’t do extensive A/B testing already), and then think about their main funnel. Do they have a checkout funnel? Do they have a signup funnel? Do they have a virility mechanism? Do they have an engagement funnel?

Go through the funnel multiple times and hypothesize about different ways it could do better to increase a core metric (conversion rate, shares, signups, etc.). Design an experiment to verify if your suggested change can actually change the core metric.

Challenge: Share it with the feedback email for the consumer internet site!

(4) Think like a Bayesian

To think like a Bayesian, avoid the Base rate fallacy. This means to form new beliefs you must incorporate both newly observed information AND prior information formed through intuition and experience.

Checking your dashboard, user engagement numbers are significantly down today. Which of the following is most likely?

1. Users are suddenly less engaged
2. Feature of site broke
3. Logging feature broke

Even though explanation #1 completely explains the drop, #2 and #3 should be more likely because they have a much higher prior probability.

You’re in senior management at Tesla, and five of Tesla’s Model S’s have caught fire in the last five months. Which is more likely?

1. Manufacturing quality has decreased and Teslas should now be deemed unsafe.
2. Safety has not changed and fires in Tesla Model S’s are still much rarer than their counterparts in gasoline cars.

While #1 is an easy explanation (and great for media coverage), your prior should be strong on #2 because of your regular quality testing. However, you should still be seeking information that can update your beliefs on #1 versus #2 (and still find ways to improve safety). Question for thought: what information should you seek?

Challenge: Identify the last time you committed the Base Rate Fallacy. Avoid committing the fallacy from now on.

(5) Know the limitations of your tools

“Knowledge is knowing that a tomato is a fruit, wisdom is not putting it in a fruit salad.” – Miles Kington

Knowledge is knowing how to perform a ordinary linear regression, wisdom is realizing how rare it applies cleanly in practice.

Knowledge is knowing five different variations of K-means clustering, wisdom is realizing how rarely actual data can be cleanly clustered, and how poorly K-means clustering can work with too many features.

Knowledge is knowing a vast range of sophisticated techniques, but wisdom is being able to choose the one that will provide the most amount of impact for the company in a reasonable amount of time.

You may develop a vast range of tools while you go through your Coursera or EdX courses, but your toolbox is not useful until you know which tools to use.

Challenge: Apply several tools to a real dataset and discover the tradeoffs and limitations of each tools. Which tools worked best, and can you figure out why?

(6) Teach a complicated concept

How does Richard Feynman distinguish which concepts he understands and which concepts he doesn’t?

Feynman was a truly great teacher. He prided himself on being able to devise ways to explain even the most profound ideas to beginning students. Once, I said to him, “Dick, explain to me, so that I can understand it, why spin one-half particles obey Fermi-Dirac statistics.” Sizing up his audience perfectly, Feynman said, “I’ll prepare a freshman lecture on it.” But he came back a few days later to say, “I couldn’t do it. I couldn’t reduce it to the freshman level. That means we don’t really understand it.” – David L. Goodstein, Feynman’s Lost Lecture: The Motion of Planets Around the Sun

What distinguished Richard Feynman was his ability to distill complex concepts into comprehendible ideas. Similarly, what distinguishes top data scientists is their ability to cogently share their ideas and explain their analyses.

Check out https://www.quora.com/Edwin-Chen… for examples of cogently-explained technical concepts.

Challenge: Teach a technical concept to a friend or on a public forum, like Quora or YouTube.

(7) Convince others about what’s important

Perhaps even more important than a data scientist’s ability to explain their analysis is their ability to communicate the value and potential impact of the actionable insights.

Certain tasks of data science will be commoditized as data science tools become better and better. New tools will make obsolete certain tasks such as writing dashboards, unnecessary data wrangling, and even specific kinds of predictive modeling.

However, the need for a data scientist to extract out and communicate what’s important will never be made obsolete. With increasing amounts of data and potential insights, companies will always need data scientists (or people in data science-like roles), to triage all that can be done and prioritize tasks based on impact.

The data scientist’s role in the company is the serve as the ambassador between the data and the company. The success of a data scientist is measured by how well he/she can tell a story and make an impact. Every other skill is amplified by this ability.

Challenge: Tell a story with statistics. Communicate the important findings in a dataset. Make a convincing presentation that your audience cares about.

Good luck and best wishes on your journey to becoming a data scientist!

https://www.quora.com/How-can-I-become-a-data-scientist-1

Leave a Reply Cancel reply