-
Notifications
You must be signed in to change notification settings - Fork 22
Building and Running The Framework
IMPORTANT! This document assumes that you have hadoop, hdfs, and hbase all running properly on whatever machine you wish to run the hadoop sleuthkit on! This is intended to help you get the framework running on a pseudo-distributed hadoop setup.
The pipeline code does end-to-end processing of a directory of documents (text extraction, document vectorization, cluster generation, etc.). To build it, do the following:
-
Run maven in the pre-build/ folder.
-
Run maven in the root project folder. This will build all of the subprojects for you and then build the pipeline jar and output it to the pipeline/target folder. This is the final jar you will run the framework from.
-
Checkout fsrip (git clone https://github.com/jonstewart/fsrip.git) and build with 'scons'. You may need to install some dependencies in order for fsrip to build successfully. The build process should help reveal what dependencies you are missing.
-
Add FSRIP_ROOT/deps/lib to LD_LIBRARY_PATH and FSRIP_ROOT/build/src/ to your PATH.
-
Set the HADOOP_HOME environment variable.
-
Copy in the report template:
% rm -Rf reports/data % hadoop fs -copyFromLocal reports /texaspete/template/reports
If you want to run the grep search, you need to put a file on HDFS with java regexes. (You will want to do this if you're running the pipeline.) One with a few (uninteresting) regexes is in the match project folder:
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% hadoop fs -put match/src/main/resources/regexes /texaspete/regexes
You can, of course, make your own regexes. Any standard java regex will work with one regex per line. If you use java globbing it should take the first glob as the "match", but this functionality is still experimental.
Now that all the code has been built, have a look at the output in the pipeline/target directory.
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% cd pipeline/target
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk/pipeline/target% ls
archive-tmp/ maven-archiver/ *sleuthkit-pipeline-1-SNAPSHOT-job.jar*
classes/ sleuthkit-pipeline-1-SNAPSHOT.jar surefire/
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk/pipeline/target%
The job jar is the one you should use with hadoop. First, you will want to run fsrip on an image to create a JSON metadata file for it. You will then want to copy BOTH the JSON Metadata file AND the image file onto HDFS (the usual directory for this is /texaspete/img, though you can put them wherever you like).
The recommended way to run the full pipeline is by using the tpkickoff.sh script located in the bin folder of the project directory. This will run the entire ingest/analysis/reporting cycle of mapreduce jobs on a single hard drive. You need to supply 3 parameters to this script. The first is a friendly name of the image (which can be any alphanumeric name that is a valid HDFS file name), the second is the path to the image on the local file system, and the third parameter is a path to the directory containing the job jar you built previously.
Note that if you wish to run the individual components of the pipeline separately, you should be able to do that from this jar by invoking their java classes directly. Most have usage/help lines which may be of use.