Page Banner

Demystifying MapReduce: Python MR Solution

The following Python code mapper.py will take the text file as input and tokenize it to create a set of <key, value> pairs. The key will be a number reflecting the no. of characters in each word, and the value will be the word itself.

#!/usr/bin/python

import re, sys
import config

PATTERN = r'[^a-zA-Z]+’
OUTPUT_FILE = ‘{0}/mapper.output’.format(config.OUTPUT_DIR)
def mapper():
ret = True

try:
with open(config.TARGET_FILE, ‘r’) as f:
words = f.read()
clean_words = re.sub(PATTERN, ‘ ‘, words)

output = open(OUTPUT_FILE, ‘a+’)
for word in clean_words.split(‘ ‘):
if len(word) > 0:
string = ‘{0},{1}n’.format(len(word), word)
output.write(string)

output.close()
except Exception as e:
print(e)
ret = False

return True

if __name__ == ‘__main__’:
sys.exit(mapper())


 

Make this mapper.py executable by using the command: chmod u+x mapper.py

Run the mapper.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py

The output will look like:



4,Ever
5,since
6,Google
9,published
3,its
8,research
5,paper
2,on
9,MapReduce…………..etc.



Now lets write a grouper.py code using Python. The following grouper.py code will take the above output <key, value> pairs from mapper.py and will generate a number of output files each having the list of one-letter words, two-letter words, three-letter words, etc. For example, sheet1.output will have all one-letter words, sheet2.output will hold all two-letter words….and sheet10.output will hold ten-letter words.


 

#!/usr/bin/python

import config
import mapper
import sys
OUTPUT_DIR = ‘{0}/grouper’.format(config.OUTPUT_DIR)

def grouper():
ret = True
try:
with open(mapper.OUTPUT_FILE, ‘r’) as f:
for line in f:
words = line.split(‘,’)
length = words[0]
# remove newline, inserted by mapper
word = words[1].replace(‘n’, ”)

output_file = ‘{0}/sheet-{1}.output’.format(OUTPUT_DIR, length)
sheet = open(output_file, ‘a+’)
sheet.write(word + ‘,’)
sheet.close()
except Exception as e:
print(e)
ret = False
return ret

if __name__ == ‘__main__’:
sys.exit(grouper())


Make this grouper.py executable by using the command: chmod u+x grouper.py

Run the grouper.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py | python ./grouper.py

The output would be,



sheet1.output: a,a,a,s,a,a,a,I,I,a,I,I,…etc.

sheet2.output: on,it,If,to,be,it,it,is,in,we,…etc.

sheet10.output: mysterious,statistics,Everything,Occurrence,Occurrence,colleagues,…etc.



 

The following reducer.py code will aggregate the occurrence of all one-letter words, two-letter words,..ten-letter words etc. in the corresponding output files from grouper.py.


 

#!/usr/bin/python

import grouper, config

import os, sys

OUTPUT_FILE = ‘{0}/reducer.output’.format(config.OUTPUT_DIR)

def reducer():
ret = True

try:
output = open(OUTPUT_FILE, ‘a+’)

for fl in os.listdir(grouper.OUTPUT_DIR):
if fl.endswith(‘.output’):
total = 0
with open(grouper.OUTPUT_DIR+’/’+fl) as f:
words = f.read()
words_split = words.split(‘,’)

# remove list with empty string
total = len(filter(None, words_split))
# print(fl + ‘ has ‘ + str(total))
output.write(‘({0}): There are {1} word(s) that have {2} characters.n’.format(fl, total, len(words_split[0])))

output.close()
except Exception as e:
print(e)
ret = False

return ret

if __name__ == ‘__main__’:
sys.exit(reducer())


 

Make this reducer.py executable by using the command: chmod u+x reducer.py

Run the reducer.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py | python ./grouper.py | python ./reducer.py

The output would be,



(sheet-1.output): There are 77 word(s) that have 1 characters.

(sheet-2.output): There are 246 word(s) that have 2 characters.

(sheet-13.output): There are 3 word(s) that have 13 characters. and so on..



 

This is the desired output for the blogging use case discussed in Demystifying MapReduce.

Additionally, all of the above scripts can be run from a shell script all at once. The following application.sh script will make use of the config.py to identify soft coded input and output directories.


#!/usr/bin/env bash

echo -n ‘Remove and create output directory.. ‘
rm -rf ./output && mkdir -p ./output/grouper
ok=$?

if test ${ok} -eq 0 ; then
echo ‘ Done.’
echo -n ‘Run mapper..’
python ./mapper.py
ok=$?
else
echo ‘ Fail.’
exit
fi

if test ${ok} -eq 1 ; then
echo ‘ Done’
echo -n ‘Run grouper..’
python ./grouper.py
ok=$?
else
echo ‘ Fail.’
exit
fi

if test ${ok} -eq 1 ; then
echo ‘ Done’
echo -n ‘Run reducer..’
python ./reducer.py
else
echo ‘Fail.’
exit
fi

if test ${ok} -eq 1 ; then
echo ‘ Done’
echo ‘Process done’
echo ‘Check statistic on ./output/reducer.output’
else
echo ‘Got error’
fi

exit $ok

 

 

 


Specify the input and output directories in config.py as follows,


TARGET_FILE = ‘./target/DemystifyingMapReduce.txt’

OUTPUT_DIR = ‘./output’

 


Finally run application.sh from the terminal to generate the desired frequency of one-letter, two-letter..etc n-letter words.

$ chmod u+x application.sh

$ ./application.sh

 

Great job! Now you can think of making this application more effective by passing multiple URL’s as inputs instead of  text files.s

Stay tuned to learn more interesting stuff!

 

PS: Wanna play with the code? Come and .