The following Python code mapper.py will take the text file as input and tokenize it to create a set of <key, value> pairs. The key will be a number reflecting the no. of characters in each word, and the value will be the word itself.
#!/usr/bin/python
import re, sys
import configPATTERN = r'[^a-zA-Z]+’
OUTPUT_FILE = ‘{0}/mapper.output’.format(config.OUTPUT_DIR)
def mapper():
ret = Truetry:
with open(config.TARGET_FILE, ‘r’) as f:
words = f.read()
clean_words = re.sub(PATTERN, ‘ ‘, words)output = open(OUTPUT_FILE, ‘a+’)
for word in clean_words.split(‘ ‘):
if len(word) > 0:
string = ‘{0},{1}n’.format(len(word), word)
output.write(string)output.close()
except Exception as e:
print(e)
ret = Falsereturn True
if __name__ == ‘__main__’:
sys.exit(mapper())
Make this mapper.py executable by using the command: chmod u+x mapper.py
Run the mapper.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py
The output will look like:
4,Ever
5,since
6,Google
9,published
3,its
8,research
5,paper
2,on
9,MapReduce…………..etc.
Now lets write a grouper.py code using Python. The following grouper.py code will take the above output <key, value> pairs from mapper.py and will generate a number of output files each having the list of one-letter words, two-letter words, three-letter words, etc. For example, sheet1.output will have all one-letter words, sheet2.output will hold all two-letter words….and sheet10.output will hold ten-letter words.
#!/usr/bin/python
import config
import mapper
import sys
OUTPUT_DIR = ‘{0}/grouper’.format(config.OUTPUT_DIR)def grouper():
ret = True
try:
with open(mapper.OUTPUT_FILE, ‘r’) as f:
for line in f:
words = line.split(‘,’)
length = words[0]
# remove newline, inserted by mapper
word = words[1].replace(‘n’, ”)output_file = ‘{0}/sheet-{1}.output’.format(OUTPUT_DIR, length)
sheet = open(output_file, ‘a+’)
sheet.write(word + ‘,’)
sheet.close()
except Exception as e:
print(e)
ret = False
return retif __name__ == ‘__main__’:
sys.exit(grouper())
Make this grouper.py executable by using the command: chmod u+x grouper.py
Run the grouper.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py | python ./grouper.py
The output would be,
sheet1.output: a,a,a,s,a,a,a,I,I,a,I,I,…etc.
sheet2.output: on,it,If,to,be,it,it,is,in,we,…etc.
sheet10.output: mysterious,statistics,Everything,Occurrence,Occurrence,colleagues,…etc.
The following reducer.py code will aggregate the occurrence of all one-letter words, two-letter words,..ten-letter words etc. in the corresponding output files from grouper.py.
#!/usr/bin/python
import grouper, config
import os, sys
OUTPUT_FILE = ‘{0}/reducer.output’.format(config.OUTPUT_DIR)
def reducer():
ret = Truetry:
output = open(OUTPUT_FILE, ‘a+’)for fl in os.listdir(grouper.OUTPUT_DIR):
if fl.endswith(‘.output’):
total = 0
with open(grouper.OUTPUT_DIR+’/’+fl) as f:
words = f.read()
words_split = words.split(‘,’)# remove list with empty string
total = len(filter(None, words_split))
# print(fl + ‘ has ‘ + str(total))
output.write(‘({0}): There are {1} word(s) that have {2} characters.n’.format(fl, total, len(words_split[0])))output.close()
except Exception as e:
print(e)
ret = Falsereturn ret
if __name__ == ‘__main__’:
sys.exit(reducer())
Make this reducer.py executable by using the command: chmod u+x reducer.py
Run the reducer.py code from the terminal: cat DemystifyMapReduce.txt | python ./mapper.py | python ./grouper.py | python ./reducer.py
The output would be,
(sheet-1.output): There are 77 word(s) that have 1 characters.
(sheet-2.output): There are 246 word(s) that have 2 characters.
(sheet-13.output): There are 3 word(s) that have 13 characters. and so on..
This is the desired output for the blogging use case discussed in Demystifying MapReduce.
Additionally, all of the above scripts can be run from a shell script all at once. The following application.sh script will make use of the config.py to identify soft coded input and output directories.
#!/usr/bin/env bash
echo -n ‘Remove and create output directory.. ‘
rm -rf ./output && mkdir -p ./output/grouper
ok=$?if test ${ok} -eq 0 ; then
echo ‘ Done.’
echo -n ‘Run mapper..’
python ./mapper.py
ok=$?
else
echo ‘ Fail.’
exit
fiif test ${ok} -eq 1 ; then
echo ‘ Done’
echo -n ‘Run grouper..’
python ./grouper.py
ok=$?
else
echo ‘ Fail.’
exit
fiif test ${ok} -eq 1 ; then
echo ‘ Done’
echo -n ‘Run reducer..’
python ./reducer.py
else
echo ‘Fail.’
exit
fiif test ${ok} -eq 1 ; then
echo ‘ Done’
echo ‘Process done’
echo ‘Check statistic on ./output/reducer.output’
else
echo ‘Got error’
fiexit $ok
Specify the input and output directories in config.py as follows,
TARGET_FILE = ‘./target/DemystifyingMapReduce.txt’
OUTPUT_DIR = ‘./output’
Finally run application.sh from the terminal to generate the desired frequency of one-letter, two-letter..etc n-letter words.
$ chmod u+x application.sh
$ ./application.sh
Great job! Now you can think of making this application more effective by passing multiple URL’s as inputs instead of text files.s
Stay tuned to learn more interesting stuff!
PS: Wanna play with the code? Come and .