Data-munging-TDSM

From The Data Science Design Manual Wikia
Jump to: navigation, search

Data Munging

Data Munging


3-1. Spend two hours getting familiar with one of the following programming languages: Python, R, MatLab, Wolfram Alpha/Language. Then write a brief paper with your impressions on its characteristics:

  • Expressibility.
  • Runtime speed.
  • Breadth of library functions.
  • Programming environment.
  • Suitability for algorithmically-intensive tasks.
  • Suitability for general data munging tasks.

(Solution 3.1)


3-3. Play around for a little while with Python, R, and MatLab. Which do you like best? What are the strengths and weaknesses of each?

(Solution 3.3)


Data Sources


3-5. A table of storage prices over time is available at http://www.jcmit.net/diskprice.htm. Analyze this data, and make a projection about the cost/volume of data storage five years from now. What will disk prices be in 25 or 50 years?

(Solution 3.5)


Data Cleaning


3-7. Find out what was weird about September 1752. What special steps might the data scientists of the day had to do to normalize annual statistics?

(Solution 3.7)


3-9. A health sensor produces a stream of twenty different values, including blood pressure, heart rate, and body temperature. Describe two or more techniques you could use to check whether the stream of data coming from the sensor is valid.

(Solution 3.9)


Implementation Projects


3-11. The laws governing voter registration records differ from state to state in the United States. Identify one or more states with very lax rules, and see what you must do to get your hands on the data. Hint: Florida.

(Solution 3.11)


Crowd Sourcing


3-13. Suppose you are paying Turkers to read texts and annotate them based on the underlying sentiment (positive or negative) that each passage converys. This is an opinion task, but how can we algorithmically judge whether the Turker was answering in a random or arbitrary manner instead of doing their job seriously?

(Solution 3.13)


Interview Questions


3-15. In general, how would you screen for outliers, and what should you do if you find one?

(Solution 3.15)


3-17. During analysis, how do you treat missing values?

(Solution 3.17)


3-19. How do you efficiently scrape web data?

(Solution 3.19)


Kaggle Challenges


3-21. Predict end of day stock returns, without being deceived by noise. https://www.kaggle.com/c/the-winton-stock-market-challenge

(Solution 3.21)