Scale-TDSM

From The Data Science Design Manual Wikia
Jump to: navigation, search

Big Data: Achieving Scale

Parallel and Distributed Processing


12-1. What is the difference between parallel processing and distributed processing?

(Solution 12.1)


12-3. Design MapReduce algorithms to take large files of integers and compute:

  • The largest integer.
  • The average of all the integers.
  • The number of distinct integers in the input.
  • The mode of the integers.
  • The median of the integers.

(Solution 12.3)


12-5. Would we expect the problem of map skew to increase or decrease when we combine counts from each file before emitting them?

(Solution 12.5)


Ethics


12-7. What are five practical ways one can go about protecting privacy in big data?

(Solution 12.7)


12-9. Give examples of decision making where you would trust an algorithm to make as good or better decisions as a person? For what tasks would you trust human judgment more than an algorithm? Why?

(Solution 12.9)


Implementation Projects


12-11. Set up a Hadoop or Spark cluster that spans two or more machines. Run a basic task like word counting. Does it really run faster than a simple job on one machine? How many machines/cores do you need in order to win?

(Solution 12.11)


Interview Questions


12-13. What is your definition of big data?

(Solution 12.13)


12-15. Give five predictions about what will happen in the world over the next 20 years?

(Solution 12.15)


12-17. How might you detect bogus reviews, or bogus Facebook accounts used for bad purposes?

(Solution 12.17)


12-19. Do you think that the typed login/password will eventually disappear? How might they be replaced?

(Solution 12.19)


12-21. What are hash table collisions? How can they be avoided? How frequently do they occur?

(Solution 12.21)


Kaggle Challenges


12-23. Which customers are worth sending junk mail to? https://www.kaggle.com/c/springleaf-marketing-response

(Solution 12.23)