Big Data: Achieving Scale

Parallel and Distributed Processing

12-1. What is the difference between parallel processing and distributed processing?

12-3. Design MapReduce algorithms to take large files of integers and compute:

The largest integer.
The average of all the integers.
The number of distinct integers in the input.
The mode of the integers.
The median of the integers.

12-5. Would we expect the problem of map skew to increase or decrease when we combine counts from each file before emitting them?

(Solution 12.5)

Ethics

12-7. What are five practical ways one can go about protecting privacy in big data?

(Solution 12.7)

12-9. Give examples of decision making where you would trust an algorithm to make as good or better decisions as a person? For what tasks would you trust human judgment more than an algorithm? Why?

(Solution 12.9)

Implementation Projects

12-11. Set up a Hadoop or Spark cluster that spans two or more machines. Run a basic task like word counting. Does it really run faster than a simple job on one machine? How many machines/cores do you need in order to win?

(Solution 12.11)

Interview Questions

12-13. What is your definition of big data?

(Solution 12.13)

12-15. Give five predictions about what will happen in the world over the next 20 years?

(Solution 12.15)

12-17. How might you detect bogus reviews, or bogus Facebook accounts used for bad purposes?

(Solution 12.17)

12-19. Do you think that the typed login/password will eventually disappear? How might they be replaced?

(Solution 12.19)

12-21. What are hash table collisions? How can they be avoided? How frequently do they occur?

(Solution 12.21)

Kaggle Challenges

12-23. Which customers are worth sending junk mail to? https://www.kaggle.com/c/springleaf-marketing-response

(Solution 12.23)

Scale-TDSM

Big Data: Achieving Scale

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools