Analyse Hadoop fsimage using the Offline Image Viewer (OIV) Tool
Hadoop fsimage is an “Image” file and its contents cannot be read easily using normal unix file system tools like cat, more etc. At times, it is very important to read the clear text version of the fsimage which holds the meta data of the file system. You can perform NameSpace Analysis, find out health of your fsimage, and even explore the interesting usage patterns.
The Offline Image Viewer is a tool used to dump the contents of hdfs fsimage files to human-readable formats in order to allow offline analysis and examination of an Hadoop cluster’s namespace. The tool is able to process very large image files relatively quickly, converting them to one of several output formats. If the tool is not able to process an image file, it will exit cleanly. The Offline Image Viewer does not require a Hadoop cluster to be running; it is entirely offline in its operation.
Lets now read and analyse the fsimage using this OIV tool.
STEP 1: Download the latest fsimage copy.
$ hdfs dfsadmin -fetchImage /tmp
14/07/08 07:27:49 INFO namenode.TransferFsImage: Opening connection to http://<nn_hostname>:50070/getimage?getimage=1&txid=latest
14/07/08 07:27:49 INFO namenode.TransferFsImage: Transfer took 0.23s at 89.74 KB/s
$ ls -ltr /tmp | grep -i fsimage
-rw-r--r-- 1 root root 22164 Jul 8 07:27 fsimage_0000000000000001386
STEP 2: Convert the fsimage into text format and view the directory structure. You need to specify the output directory using the default “-o” option.The simplest usage of the Offline Image Viewer is to provide just an input and output file, via the -i and -o command-line switches.
STEP 3:One can specify which output processor via the command-line switch -p. For instance: If you want to read all metadata information and not just the directory structure, use the “Indented” option.
The Offline Image Viewer makes it easy to gather large amounts of data about the hdfs namespace. This information can then be used to explore file system usage patterns or find specific files that match arbitrary criteria, along with other types of namespace analysis. The Delimited Image Processor in particular creates output that is amenable to further processing by tools such as Apache Pig. Pig provides a particularly good choice for analyzing the data as it is able to deal with the output generated from a small fsimage but also scales up to consume data from extremely large file systems.
The Delimited Image Processor generates lines of text separated by default and tabs and includes all of the fields that are common between constructed files and files that were still under constructed when the fsimage was generated. We can use this output to accomplish three tasks:
Determine the number of files each user has created on the file system
Find files that were created but have not been accessed
Find probable duplicates of large files by comparing the size of each file.