Difference between revisions of "Scale-TDSM"
(Created page with "= Big Data: Achieving Scale = '''Parallel and Distributed Processing''' <br>12-1. What is the difference between parallel processing and distributed processing? TDSM 12....") |
(No difference)
|
Revision as of 22:16, 31 March 2017
Big Data: Achieving Scale
Parallel and Distributed Processing
12-1.
What is the difference between parallel processing and distributed processing?
12-3.
Design MapReduce algorithms to take large files of integers and
compute:
- The largest integer.
- The average of all the integers.
- The number of distinct integers in the input.
- The mode of the integers.
- The median of the integers.
12-5.
Would we expect the problem of map skew to increase or decrease when
we combine counts from each file before emitting them?
Ethics
12-7.
What are five practical ways one can go about protecting privacy in big data?
12-9.
Give examples of decision making where you would trust an algorithm to make as good or better decisions as a person? For what tasks would you trust human judgment more than an algorithm? Why?
Implementation Projects
12-11.
Set up a Hadoop or Spark cluster that spans two or more machines. Run a basic task like word counting. Does it really run faster than a simple job on one machine? How many machines/cores do you need in order to win?
Interview Questions
12-13.
What is your definition of big data?
12-15.
Give five predictions about what will happen in the world over the next 20 years?
12-17.
How might you detect bogus reviews, or bogus Facebook accounts used for bad purposes?
12-19.
Do you think that the typed login/password will eventually disappear? How might they be replaced?
12-21.
What are hash table collisions? How can they be avoided? How frequently do they occur?
Kaggle Challenges
12-23.
Which customers are worth sending junk mail to?
https://www.kaggle.com/c/springleaf-marketing-response