Harvest Overview

Harvest is a command-line tool for extracting metadata from PDS4 products (labels). It parses PDS4 files and stores extracted metadata in an intermediate XML file. This intermediate file can be loaded into Solr by Registry Manager or standard Solr tools, such as Solr post tool.

Harvest executable scripts for Windows (harvest.bat) and Linux / Mac (harvest) are located in bin sub-folder of the installation directory (e.g., /home/pds/harvest/).

To see the basic usage information (shown below) run Harvest without any parameters.

usage: harvest <options>
 -c  <file>    Harvest configuration file.
 -l  <file>    Log file. Default is /tmp/harvest/harvest.log.
 -o  <dir>     Output directory for Solr documents. Default is /tmp/harvest/solr
 -v  <level>   Logger verbosity: 0=Debug, 1=Info (default), 2=Warning, 3=Error.

Quick Start

To run Harvest you need an XML configuration file. For example, to process all XML files in /data/LADEE/ldex_20161118 folder and all subfolders, create the following configuration file and save it as /tmp/ladee.cfg.

<?xml version="1.0" encoding="UTF-8"?>

<harvest>
  <directories>
    <path>/data/LADEE/ldex_20161118</path>
    <fileFilter>
      <include>*.xml</include>
    </fileFilter>
  </directories>
</harvest>

Then run Harvest

harvest -c /tmp/ladee.cfg

The tool will print some log messages and create an intermediate file, solr-docs.xml in default output directory /tmp/harvest/solr/. The file contains a list of Solr documents in XML format whch can be loaded into Solr by Registry Manager or standard Solr tools. An example Solr document is shown below.

<doc>
  <field name="lid">urn:nasa:pds:ladee_ldex</field>
  <field name="vid">1.2</field>
  <field name="lidvid">urn:nasa:pds:ladee_ldex::1.2</field>
  <field name="title">LADEE LUNAR DUST EXPERIMENT</field>
  <field name="product_class">Product_Bundle</field>
  <field name="_file_name">bundle_ladee_ldex.xml</field>
  <field name="_file_type">application/xml</field>
  <field name="_file_size">5735</field>
  <field name="_file_md5">zlYAt05W/Ag6Qy4HlNYy+g==</field>
  <field name="_xml_root_element">Product_Bundle</field>
  <field name="_package_id">8627271a-01f5-49ad-8ce5-69f78fd6b5f4</field>
  <field name="instrument_host_ref">urn:nasa:pds:context:instrument_host:spacecraft.ladee</field>
  <field name="instrument_ref">urn:nasa:pds:context:instrument:instrument.ldex__ladee</field>
  <field name="investigation_ref">urn:nasa:pds:context:investigation:mission.ladee</field>
  <field name="target_ref">urn:nasa:pds:context:target:dust.dust</field>
  <field name="target_ref">urn:nasa:pds:context:target:satellite.moon</field>
</doc>

By default Harvest extracts lid, vid, title, product_class, all internal refrences and basic file information, such as file name, type, size, and MD5 hash.

Package ID

Each Harvest run generates unique package ID, stored in _package_id field. After loading extracted metadata into Solr, all documents from a particular Harvest run can be deleted by this package id.

Extracting More Metadata

For example, to extract start_date_time and stop_date_time from observational products, you have to define an XPath to field name map. First, create an XML file shown below and save it as /home/pds/harvest/conf/observational.xml. You can use another file name or directory if you want.

<?xml version="1.0" encoding="UTF-8"?>

<xpaths>
  <xpath fieldName="start_date_time">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
  <xpath fieldName="stop_date_time">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>

These XPaths will be used to extract start and stop date values which will be saved in the intermediate XML file as start_date_time and stop_date_time fields.

Next, add the following section to the Harvest configuration file.

<harvest>
...
  <xpathMaps baseDir="/home/pds/harvest/conf">
    <xpathMap rootElement="Product_Observational" filePath="observational.xml" />
  </xpathMaps>
</harvest>

Now, if you run Harvest, Solr documents for observational products will contain start and stop dates.

<doc>
  <field name="lid">urn:nasa:pds:ladee_ldex:data_derived:derived_ldex_ltden_pds_derived_tab</field>
  <field name="vid">1.2</field>
...
  <field name="start_date_time">2013-10-25T00:00:00Z</field>
  <field name="stop_date_time">2014-04-18T04:30:00Z</field>
</doc>

Note that baseDir attribute is optional. You can also provide full path in filePath attribute as shown below

  <xpathMaps>
    <xpathMap rootElement="Product_Observational" filePath="/home/pds/harvest/conf/observational.xml" />
  </xpathMaps>

The rootElement attribute is also optional. In the above example, the XPath queries will run against observational products only. If you remove rootElement attribute, then the same XPath queries will run against all products, such as Product_Collection, Product_Document, etc.

Finally, you can have multiple <xpathMap> entries, even for the same rootElement.

BLOB Storage

You can store whole PDS product labels as BLOBs (Binary Large OBjects). To enable this feature add the following section in Harvest configuration file.

<blobStorage type="embedded" />

After running Harvest, solr-docs.xml "intermediate" file will have _file_blob field with zipped product label. You can expect up to 900% compression rate for some files. For example, many LADEE housekeeping labels are about 45KB. BLOB size is about 5KB. For smaller files, such as collection labels, compression rate is about 350% (5.5KB file is compressed to 1.6KB).

After loading data into Solr, you can extract BLOBs by running Registry Manager tool:

registry-manager export-file -lidvid urn:nasa:pds:ladee_ldex:data_calibrated::1.2 -filePath /tmp/data_calibrated.xml

File Reference / Access URL

To store full path of a product label file, add the following section in Harvest configuration file.

<fileRef/>

After running Harvest, you should see _file_ref field added to each Solr document:

<doc>
...
  <field name="_file_ref">/C:/data/LADEE/ldex_20161118/bundle_ladee_ldex.xml</field>
...
</doc>
Note that on Windows, backslashes are replaced with forward slashes and disk letter is included.

To replace file path prefix with another value, let's change <fileRef/> tag in Harvest configuration file:

<fileRef>
  <replace prefix="/C:/data/LADEE/" replacement="https://pds.nasa.gov/data/pds4/" />
</fileRef>

Now, after running Harvest, you shoul see different _file_ref value:

<doc>
...
  <field name="_file_ref">https://pds.nasa.gov/data/pds4/ldex_20161118/bundle_ladee_ldex.xml</field>
...
</doc>

Directories and Filters

Crawl Products from Multiple Directories

To process products from multiple directories, specify multiple <path> entries in Harvest configuration file:

<harvest>
  <directories>
    <path>/data/LADEE/ldex_20161118/data_calibrated</path>
    <path>/data/LADEE/ldex_20161118/data_derived</path>
...
</harvest>

Filtering Files

This feature has limited functionality since all PDS4 product labels are XML files and Harvest could not process other files. Usually you would include the following file filter in most Harvest configuration files:

<harvest>
  <directories>
...
    <fileFilter>
      <include>*.xml</include>
    </fileFilter>
  </directories>
</harvest>
If you don't provide any filters, Harvest will try to pocess every file and for non-XML files you will see "Unsupported MIME type" log messages.

You can use one or more <include> filters or one or more <exclude> filters, but not both <include> and <exclude> at the same time.

Excluding Sub-Directories

For example, to exclude xml_schema sub-folder, add a directory filter in Harvest configuration file:

<harvest>
  <directories>
...
    <directoryFilter>
      <exclude>xml_schema</exclude>
    </directoryFilter>
  </directories>
</harvest>
There is no include option for sub-directories.

Filtering Products

You can include or exclude products. For example, to only process documents, add following product filter in Harvest configuration file:

<harvest>
  <directories>
...
    <productFilter>
      <include>Product_Document</include>
    </productFilter>
  </directories>
</harvest>