Data-science-TDSM

From The Data Science Design Manual Wikia
Jump to: navigation, search

What is Data Science?

Identifying Data Sets


1-1. Identify where interesting data sets relevant to the following domains can be found on the web:

  1. books.
  2. horse racing.
  3. stock prices.
  4. risks of diseases.
  5. colleges and universities.
  6. crime rates.
  7. bird watching.

For each of these data sources, explain what you must do to turn this data into a usable format on your computer for analysis.

(Solution 1.1)


1-3. Visit data.gov, and identify five data sets that sound interesting to you. For each write a brief description, and propose three interesting things you might do with them.

(Solution 1.3)


Asking Questions


1-5. Visit Entrez, the National Center for Biotechnology Information (NCBI) portal. Investigate what data sources are available, particularly the Pubmed and Genome resources. Propose three interesting projects to explore with each of them.

(Solution 1.5)


1-7. You would like to conduct an experiment to see whether students learn better if they study without any music, with instrumental music, or with songs that have lyrics. Briefly outline the design for such a study.

(Solution 1.7)


Implementation Projects


1-9. Write a program to scrape the best-seller rank for a book on Amazon.com. Use this to plot the rank of all of Skiena's books over time. Which one of these books should be the next item that you purchase? Do you have friends for whom they would make a welcome and appropiate gift? :-)

(Solution 1.9)


Interview Questions


1-11. For each of the following questions: (1) produce a quick guess based only on your understanding of the world, and then (2) use Google to find supportable numbers to produce a more principled estimate from. How much did your two estimates differ by?

  1. How many piano tuners are there in the entire world?
  2. How much does the ice in a hockey rink weigh?
  3. How many gas stations are there in the United States?
  4. How many people fly in and out of LaGuardia Airport every day?
  5. How many gallons of ice cream are sold in the U.S each year?
  6. How many basketballs are purchased by the National Basketball Association (NBA) each year?
  7. How many fish are there in all the world's oceans?
  8. How many people are flying in the air right now, all over the world?
  9. How many ping-pong balls can fit in a large commercial jet?
  10. How many miles of paved road are there in your favorite country?
  11. How many dollar bills are sitting in the wallets of all people at Stony Brook University?
  12. How many gallons of gasoline does a typical gas station sell per day?
  13. How many words are there in this book?
  14. How many cats live in New York City?
  15. How much would it cost to fill a typical car's gas tank with Starbuck's coffee?
  16. How much tea is there in China?
  17. How many checking accounts are there in the United States?

(Solution 1.11)


1-13. How would you build a data-driven recommendation system? What are the limitations of this approach?

(Solution 1.13)


1-15. Do you think data science is an art or a science?

(Solution 1.15)


Kaggle Challenges


1-17. Where is a particular taxi cab going? https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i

(Solution 1.17)