Harvest

Overview

Harvest is a command-line tool for extracting metadata from PDS4 products (labels). It parses PDS4 files and stores extracted metadata in a "newline delimited JSON" or XML file. The JSON file can be loaded into Elasticsearch by Registry Manager. The XML file can be loaded into Apache Solr by Solr "post" tool or from Solr admin UI.

Harvest executable scripts for Windows (harvest.bat) and Linux / Mac (harvest) are located in bin sub-folder of the installation directory (e.g., /home/pds/harvest/).

To see the basic usage information (shown below) run Harvest without any parameters.

Usage: harvest <options>

Required parameters:
  -c <file>     Harvest configuration file
Optional parameters:
  -f <format>   Output format ('json' or 'xml'). Default is 'json'
  -o <dir>      Output directory. Default is /tmp/harvest/out
  -l <file>     Log file. Default is /tmp/harvest/harvest.log
  -v <level>    Logger verbosity: 0=Debug, 1=Info (default), 2=Warning, 3=Error

Quick Start

To run Harvest you need an XML configuration file. The configuration file has several sections which control which folders and files the Harvest tool will crawl and what data to extract. Very basic configuration is shown below.

<?xml version="1.0" encoding="UTF-8"?>

<harvest>
  <directories>
    <path>/data/LADEE/ldex_20161118</path>
    <fileFilter>
      <include>*.xml</include>
    </fileFilter>
  </directories>
</harvest>

If you save this file as /tmp/ladee.cfg and run Harvest

harvest -c /tmp/ladee.cfg

all XML files in /data/LADEE/ldex_20161118 folder and its subfolders will be processed. By default, Harvest extracts only few metadata fields, such as lid, vid, title, and product class. Extracted metadata is saved in es-docs.json file. Default output directory is /tmp/harvest/out/.

The JSON file has two lines per record / PDS label. The first line contains document ID, for example,

{"index":{"_id":"urn:nasa:pds:ladee_ldex::1.2"}}

The second line contains all metadata extracted from a PDS label. In the example below we split the JSON into multiple lines to make it more readable.

{
  "lid":"urn:nasa:pds:ladee_ldex",
  "vid":"1.2",
  "lidvid":"urn:nasa:pds:ladee_ldex::1.2",
  "title":"LADEE LUNAR DUST EXPERIMENT",
  "product_class":"Product_Bundle",
  "_package_id":"9e7c65ec-cc82-4e2f-b2b7-365dc4d028e0"
}

The JSON file generated by Harvest can be loaded into Elasticsearch by Registry Manager tool or by calling Elasticsearch bulk load API.

Package ID

Each Harvest run generates unique package ID, stored in _package_id field. After loading extracted metadata into Elasticsearch, all documents from a particular Harvest run can be deleted by this package id.

Input Directories and Filters

Crawl Products from Multiple Directories

To process products from multiple directories, specify multiple <path> entries in Harvest configuration file:

<harvest>
  <directories>
    <path>/data/LADEE/ldex_20161118/data_calibrated</path>
    <path>/data/LADEE/ldex_20161118/data_derived</path>
...
</harvest>

Filtering Files

This feature has limited functionality since all PDS4 product labels are XML files and Harvest could not process other files. Usually you would include the following file filter in most Harvest configuration files:

<harvest>
  <directories>
...
    <fileFilter>
      <include>*.xml</include>
    </fileFilter>
  </directories>
</harvest>
If you don't provide any filters, Harvest will try to pocess every file and for non-XML files you will see "Unsupported MIME type" log messages.

You can use one or more <include> filters or one or more <exclude> filters, but not both.

Excluding Sub-Directories

For example, to exclude xml_schema sub-folder, add a directory filter in Harvest configuration file:

<harvest>
  <directories>
...
    <directoryFilter>
      <exclude>xml_schema</exclude>
    </directoryFilter>
  </directories>
</harvest>
There is no include option for sub-directories.

Filtering Products

You can include or exclude products. For example, to only process documents, add following product filter in Harvest configuration file:

<harvest>
  <directories>
...
    <productFilter>
      <include>Product_Document</include>
    </productFilter>
  </directories>
</harvest>

Extracting More Metadata

Label and Data File Information

To extracts label and data file information, such as file name, mime type, size, and MD5 hash, include the following section in the configuration file.

<fileInfo />

Now if you run Harvest, both the label file information

"ops/Label_File_Info/ops/creation_date_time":"2020-11-18T22:25:05Z",
"ops/Label_File_Info/ops/file_name":"naif0012.xml",
"ops/Label_File_Info/ops/file_ref":"/C:/tmp/d5/naif0012.xml",
"ops/Label_File_Info/ops/file_size":"3398",
"ops/Label_File_Info/ops/md5_checksum":"69ea2974a93854d90399b8b8fc3d1334"

and data file information

"ops/Data_File_Info/ops/creation_date_time":"2020-11-18T22:25:17Z",
"ops/Data_File_Info/ops/file_name":"naif0012.tls",
"ops/Data_File_Info/ops/file_ref":"/C:/tmp/d5/naif0012.tls",
"ops/Data_File_Info/ops/file_size":"5257",
"ops/Data_File_Info/ops/md5_checksum":"25a2fff30b0dedb4d76c06727b1895b1",
"ops/Data_File_Info/ops/mime_type":"text/plain",

will be extracted.

If you don't want to process data files, add the following flag in Harvest configuration file.

<fileInfo processDataFiles="false" />

BLOB Storage

You can store whole PDS product labels as BLOBs (Binary Large OBjects). To enable this feature, modify fileInfo section in Harvest configuration file.

<fileInfo storeLabels="true" />

After running Harvest, es-docs.json file will have "ops/Label_File_Info/ops/blob" field with compressed product label. You can expect up to 900% compression rate for some files. For example, many LADEE housekeeping labels are about 45KB. Compressed BLOB size is about 5KB. For smaller files, such as collection labels, compression rate is about 350% (5.5KB file is compressed to 1.6KB).

After loading data into Elasticsearch, you can extract BLOBs by running Registry Manager tool:

registry-manager export-file -lidvid urn:nasa:pds:ladee_ldex:data_calibrated::1.2 -file /tmp/data_calibrated.xml

File Reference / Access URL

Harvest extracts absolute paths of product and label files, such as

"ops/Label_File_Info/ops/file_ref":"/tmp/d5/naif0012.xml",
"ops/Data_File_Info/ops/file_ref":"/tmp/d5/naif0012.tls",

Note that on Windows, backslashes are replaced with forward slashes and disk letter is included.

"ops/Label_File_Info/ops/file_ref":"C:/tmp/d4/bundle_orex_spice_v009.xml",

To replace a file path prefix with another value, such as a URL, add <fileRef/> tag in Harvest configuration file:

<fileInfo>
  <fileRef replacePrefix="/C:/tmp/d4/" with="https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/" />
</fileInfo>

After running Harvest, you should get different file_ref value:

"ops/Label_File_Info/ops/file_ref":
    "https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/bundle_orex_spice_v009.xml"

Internal References

To extract all internal references, add the following section in Harvest configuration file.

<internalRefs prefix="ref_">
  <lidvid convertToLid="true" keep="true" />
</internalRefs>

Example output is shown below.

"ref_lid_document":"urn:nasa:pds:ladee_uvs:document:DPSIS",
"ref_lid_instrument":"urn:nasa:pds:context:instrument:instrument.uvs__ladee",
"ref_lid_investigation":"urn:nasa:pds:context:investigation:mission.ladee",
"ref_lid_product":"urn:nasa:pds:ladee_uvs:calibration:wavelength",
"ref_lid_product":"urn:nasa:pds:ladee_uvs:raw:0016o_0000",
"ref_lid_target":"urn:nasa:pds:context:target:satellite.moon",
"ref_lidvid_product":"urn:nasa:pds:ladee_uvs:calibration:wavelength::1.0",
"ref_lidvid_product":"urn:nasa:pds:ladee_uvs:raw:0016o_0000::1.0",

The format of generated field names is as follows:

<prefix><lid or lidvid>_<reference type>

Prefix is configurable. Lidvids can be converted to lids. If keep attribute is true as in the example above, both original lidvid and generated lid references are saved. If keep attribute is false, then only lid reference is saved.

Extract Metadata by XPath

To extract metadata by XPath, you have to create one or more mapping files and list them in Harvest configuration file as shown below.

<harvest>
...
  <xpathMaps baseDir="/home/pds/harvest/conf">
    <xpathMap filePath="common.xml" />
    <xpathMap rootElement="Product_Observational" filePath="observational.xml" />
  </xpathMaps>
</harvest>

In the example above there are two xpathMap entries. Each entry must have filePath attribute pointing to a mapping file. A path can be either absolute or relative to the baseDir attribute of the xpathMaps tag. The baseDir attribute is optional. The same example with absolute paths is shown below.

  <xpathMaps>
    <xpathMap filePath="/home/pds/harvest/conf/common.xml" />
    <xpathMap rootElement="Product_Observational" filePath="/home/pds/harvest/conf/observational.xml" />
  </xpathMaps>

An xpathMap entry can have optional rootElement attribute. Without this attribute, XPaths queries defined in a mapping file (common.xml), will run against every XML document processed by Harvest. With rootElement attribute, only XMLs with that root element will be processed.

Mapping Files

A mapping file has one or more entries which map an output field name to an XPath query. For example, to extract start_date_time and stop_date_time from observational products, you can use the following file.

<?xml version="1.0" encoding="UTF-8"?>
<xpaths>
  <xpath fieldName="start_date_time">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
  <xpath fieldName="stop_date_time">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>

You can use optional dataType="date" attribute to convert valid PDS dates to ISO-8601 "instant" format (e.g., "2013-10-24T00:49:37.457Z").

<xpaths>
  <xpath fieldName="start_date_time" dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
  <xpath fieldName="stop_date_time" dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>

XML Name Spaces

Harvest ignores namespaces when extracting metadata by XPath. Below is a fragment of LADEE UVS product label which uses "ladee" namespace for mission area fields.

  <Observation_Area>
    <Mission_Area>
      <ladee:latitude>17.2367925372247</ladee:latitude>
      <ladee:longitude>194.054477731391</ladee:longitude> 
      ...

To extract latitude and longitude you can use the following XPaths without namespaces.

<xpaths>
  <xpath fieldName="latitude">//Mission_Area/latitude</xpath>
  <xpath fieldName="longitude">//Mission_Area/longitude</xpath>
</xpaths>

Extract Metadata by Data Dictionary Class / Extract All Fields

Harvest can automatically "flatten out" PDS label files and generate field names using the following naming convention:

<namespace>/<class name>/<namespace>/<attribute name>

For example

"pds/Investigation_Area/pds/name":"LADEE",
"pds/Investigation_Area/pds/type":"Mission",
"pds/Mission_Area/ladee/activity_number":"16",
"pds/Mission_Area/ladee/activity_type":"Occultation",
"pds/Mission_Area/ladee/altitude":"245.731087458064",

To extract all fields, add the following section in Harvest configuration file.

  <autogenFields />

To extract a subset of fields, add classFilter section which can have a list of either include or exclude filters (but not both). The following example will extract all fields from mission area.

  <autogenFields>
    <classFilter>
      <include>pds.Mission_Area</include>
    </classFilter>
  </autogenFields>

Date Fields

Harvest will try to convert all fields containing "date" string in their names to ISO-8601 "instant" format (e.g., "2013-10-24T00:49:37.457Z"). If a field value could not be converted to ISO-8601 format, a warning message will be printed and original value will be saved.