-
Notifications
You must be signed in to change notification settings - Fork 22
Framework Overview
Processing documents has three phases: ingest, analysis, and reporting. Here we describe what happens in these phases and where they occur in code.
The entire process occurs in the tpkickoff.sh file, located in the bin/ folder. Looking there is a good way to get a handle for how the process works.
We use both HDFS and HBase to store file information.
The folder structure for HDFS is:
| texaspete \ data \ $IMG_ID | crossimg (cross-image scoring calculation data) | extents (volume extents information) | grep (file text search results) | reports (raw report json data) | text (text extracted from files) | reports.zip (contains the final report)
The HDFS Tables are:
-
entries- contains information about all of the files on HDFS. TheFSEntryclass wraps around the entries in this table usingFSEntryHBaseInputFormatandFSEntryHBaseOutputFormatto provide a map-reduce friendly way to read and write data about files. -
images- contains information about the disk images that have been uploaded. -
hashes- contains hashes for seen, known-good, and known-bad files. This table is populated as images are added and also by the NSRL uploading tools. If a file was seen on a particular drive (as is the case with files added here during the ingest process), this table contains information about the hash of that drive (so there is a link between the file and the drive it was seen on).
Ingest has two parts:
In the first part, we run the fsrip utility on a hard disk image twice; once to get information about the file system and once to get information about the files on the hard drive, using fsrip dumpfs and fsrip dumpimg respectively. We dump both of these records onto HDFS using com.lightboxtechnologies.spectrum.Uploader for disk volume information and com.lightboxtechnologies.spectrum.InfoPutter for file data (the raw file data is actually just stuffed into HBase here and sorted out later). Note that we have two identifiers for images; we take a hash of the image file as we begin the ingest process, and we also supply a "friendly name" of the image.
We then kick off the rest of the ingest process by invoking org.sleuthkit.hadoop.pipeline.Ingest. This in turn kicks off three separate map-reduce jobs, which are described below:
-
com.lightboxtechnologies.spectrum.JsonImport- This kicks off a map reduce job which populates the HBase entries table with information from the hard drive files. -
com.lightboxtechnologies.spectrum.ExtentsExtractorandcom.lightboxtechnologies.spectrum.ExtractData- these two steps put the raw file data into HBase, or links to the file data in the case where a file is large enough that we do not want to store its data in HBase.
At the end of the ingest process, we have rows in the HBase table "entries" which is filled with information about all of the files on the hard drive, an entry in the "images" table containing a bit of information about the hard drive image we have just uploaded, and some lines in the "hashes" table indicating that we know about some additional files that belong to this hard drive.