TDSM 12.3

From The Data Science Design Manual Wikia
Jump to: navigation, search

1. Largest Number

map(file_id,iterator numbers){

   max=INTEGER.MIN_VALUE
   while(numbers.hasNext()):
        num=numbers.next()
        if(num>max):
           max=num
   end while
   emit('max',max)

}


reduce(key, iterator max_values){

   max=INTEGER.MIN_VALUE
   while(max_values.hasNext()):
        num=max_values.next()
        if(num>max):
           max=num
   end while
   emit('overall_max',max)   

}


We are given a list of files and each file has list of numbers. In MapReduce, each node parallelly picks a file and executes map function by passing file_num as key and list of integers in the file as iterator. map function then finds the maximum in that file. then the map function maps the maximum of that file with the key 'max' and emits to the map-reduce framework which distributes the key-value pair to the network of nodes

reduce function recieves single key 'max' and a list of maximum values. The elements of max_values are maximum of each file. Then reduce function finds the maximum among the max_values.


2. Average :

map(file_id,iterator numbers){

   sum=0
   count=0
   while(numbers.hasNext()):
        num=numbers.next()
        sum+=num
        count+=1
   end while
   emit('avg',(sum,count))

}

  1. here the output is a tuple of sum and count as (sum,count)

reduce(key, iterator sum_count_tuples){

   sum=0
   count=0
   while(sum_count_tuples.hasNext()):
        sum_i,count_i=sum_count_tuples.next()
        sum=sum+sum_i
        count=count+count_i
   end while
   emit('overall_avg',(sum/count))   

}

3. Distinct:

map(file_id,iterator numbers){

   while(numbers.hasNext()):
        num=numbers.next()
        emit(num,1)
   end while

}

  1. in map-reduce , emit is not return operation , it emits the key,value to network, so it can be inside loop
  2. here number is the key

reduce(uniq_num, iterator values){

   emit(uniq_num,1)   

}

  1. here the values is list of 1s as [1,1,1,...1] for each unique number. so the emit in reduce produces 1 for each unique number. uniq_num is key