Demystifying MapReduce: Python MR Solution

The following Python code mapper.py will take the text file as input and tokenize it to create a set of <key, value> pairs. The key will be a number reflecting the no. of characters in each word, and the value will be the word itself.

#!/usr/bin/python

import re, sys import config

PATTERN = r'[^a-zA-Z]+’ OUTPUT_FILE = ‘{0}/mapper.output’.format(config.OUTPUT_DIR) def mapper(): ret = True

try: with open(config.TARGET_FILE, ‘r’) as f: words = f.read() clean_words = re.sub(PATTERN, ‘ ‘, words)

output = open(OUTPUT_FILE, ‘a+’) for word in clean_words.split(‘ ‘): if len(word) > 0: string = ‘{0},{1}n’.format(len(word), word) output.write(string)

output.close() except Exception as e: print(e) ret = False

return True

if __name__ == ‘__main__’: sys.exit(mapper())


 

Make this mapper.py executable by using the command: chmod u+x mapper.py

Run the mapper.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py

The output will look like:



4,Ever 5,since 6,Google 9,published 3,its 8,research 5,paper 2,on 9,MapReduce…………..etc.



Now lets write a grouper.py code using Python. The following grouper.py code will take the above output <key, value> pairs from mapper.py and will generate a number of output files each having the list of one-letter words, two-letter words, three-letter words, etc. For example, sheet1.output will have all one-letter words, sheet2.output will hold all two-letter words….and sheet10.output will hold ten-letter words.


 

#!/usr/bin/python

import config import mapper import sys OUTPUT_DIR = ‘{0}/grouper’.format(config.OUTPUT_DIR)

def grouper(): ret = True try: with open(mapper.OUTPUT_FILE, ‘r’) as f: for line in f: words = line.split(‘,’) length = words[0] # remove newline, inserted by mapper word = words[1].replace(‘n’, ”)

output_file = ‘{0}/sheet-{1}.output’.format(OUTPUT_DIR, length) sheet = open(output_file, ‘a+’) sheet.write(word + ‘,’) sheet.close() except Exception as e: print(e) ret = False return ret

if __name__ == ‘__main__’: sys.exit(grouper())


Make this grouper.py executable by using the command: chmod u+x grouper.py

Run the grouper.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py | python ./grouper.py

The output would be,



sheet1.output: a,a,a,s,a,a,a,I,I,a,I,I,…etc.

sheet2.output: on,it,If,to,be,it,it,is,in,we,…etc.

sheet10.output: mysterious,statistics,Everything,Occurrence,Occurrence,colleagues,…etc.



 

The following reducer.py code will aggregate the occurrence of all one-letter words, two-letter words,..ten-letter words etc. in the corresponding output files from grouper.py.


 

#!/usr/bin/python

import grouper, config

import os, sys

OUTPUT_FILE = ‘{0}/reducer.output’.format(config.OUTPUT_DIR)

def reducer(): ret = True

try: output = open(OUTPUT_FILE, ‘a+’)

for fl in os.listdir(grouper.OUTPUT_DIR): if fl.endswith(‘.output’): total = 0 with open(grouper.OUTPUT_DIR+’/’+fl) as f: words = f.read() words_split = words.split(‘,’)

# remove list with empty string total = len(filter(None, words_split)) # print(fl + ‘ has ‘ + str(total)) output.write(‘({0}): There are {1} word(s) that have {2} characters.n’.format(fl, total, len(words_split[0])))

output.close() except Exception as e: print(e) ret = False

return ret

if __name__ == ‘__main__’: sys.exit(reducer())


 

Make this reducer.py executable by using the command: chmod u+x reducer.py

Run the reducer.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py | python ./grouper.py | python ./reducer.py

The output would be,



(sheet-1.output): There are 77 word(s) that have 1 characters.

(sheet-2.output): There are 246 word(s) that have 2 characters.

(sheet-13.output): There are 3 word(s) that have 13 characters. and so on..



 

This is the desired output for the blogging use case discussed in Demystifying MapReduce.

Additionally, all of the above scripts can be run from a shell script all at once. The following application.sh script will make use of the config.py to identify soft coded input and output directories.


#!/usr/bin/env bash

echo -n ‘Remove and create output directory.. ‘ rm -rf ./output && mkdir -p ./output/grouper ok=$?

if test ${ok} -eq 0 ; then echo ‘ Done.’ echo -n ‘Run mapper..’ python ./mapper.py ok=$? else echo ‘ Fail.’ exit fi

if test ${ok} -eq 1 ; then echo ‘ Done’ echo -n ‘Run grouper..’ python ./grouper.py ok=$? else echo ‘ Fail.’ exit fi

if test ${ok} -eq 1 ; then echo ‘ Done’ echo -n ‘Run reducer..’ python ./reducer.py else echo ‘Fail.’ exit fi

if test ${ok} -eq 1 ; then echo ‘ Done’ echo ‘Process done’ echo ‘Check statistic on ./output/reducer.output’ else echo ‘Got error’ fi

exit $ok

 

 

 


Specify the input and output directories in config.py as follows,


TARGET_FILE = ‘./target/DemystifyingMapReduce.txt’

OUTPUT_DIR = ‘./output’

 


Finally run application.sh from the terminal to generate the desired frequency of one-letter, two-letter..etc n-letter words.

$ chmod u+x application.sh

$ ./application.sh

 

Great job! Now you can think of making this application more effective by passing multiple URL’s as inputs instead of  text files.s

Stay tuned to learn more interesting stuff!

 

PS: Wanna play with the code? Come and grab it here.