Harvest

Overview

Harvest is a command-line tool for extracting metadata from PDS4 products (labels). It parses PDS4 files and stores extracted metadata in a "newline delimited JSON" or XML file. The JSON file can be loaded into Elasticsearch by Registry Manager. The XML file can be loaded into Apache Solr by Solr "post" tool or from Solr admin UI.

Harvest executable scripts for Windows (harvest.bat) and Linux / Mac (harvest) are located in bin sub-folder of the installation directory (e.g., /home/pds/harvest/).

To see the basic usage information (shown below) run Harvest without any parameters.

Usage: harvest <options>

Required parameters:
  -c <file>     Harvest configuration file
Optional parameters:
  -f <format>   Output format ('json' or 'xml'). Default is 'json'
  -o <dir>      Output directory. Default is /tmp/harvest/out
  -l <file>     Log file. Default is /tmp/harvest/harvest.log
  -v <level>    Logger verbosity: DEBUG, INFO (default), WARNING, ERROR

Quick Start

To run Harvest you need an XML configuration file. The configuration file has several sections which control which files the Harvest tool will crawl and what data to extract. The ideal configurations for Harvest are either conf/examples/bundles.xml or conf/examples/directories.xml. For each of those, you will want to update the nodeName:

<harvest nodeName="PDS_ATM">
Registry (Elasticsearch) configuration:
<registry url="http://localhost:9200" index="registry" auth="/path/to/auth.cfg" />
The path to the data:
  <directories>
    <path>/data/LADEE/ldex_20161118"</path>
  </directories>
And the URL prefix for the data:
  <fileInfo processDataFiles="true" storeLabels="true">
    <!-- UPDATE with your own local path and base url where pds4 archive are published -->
    <fileRef replacePrefix="/data" with="https://pds-atmospheres.nmsu.edu/" />
  </fileInfo>

If you save this file as /tmp/ladee.cfg and run Harvest

harvest -c /tmp/ladee.cfg

all XML files in /data/LADEE/ldex_20161118 folder and its subfolders will be processed. Leaving the `autogenFields` flag enabled in the configuration will ingest all metadata from the PDS4 labels. Extracted metadata is saved in registry-docs.json file. Default output directory is /tmp/harvest/out/.

The JSON file has two lines per record / PDS label. The first line contains document ID, for example,

{"index":{"_id":"urn:nasa:pds:ladee_ldex::1.2"}}

The second line contains all metadata extracted from a PDS label. In the example below we split the JSON into multiple lines to make it more readable.

{
  "lid": "urn:nasa:pds:ladee_ldex",
  "vid": 1.2,
  "lidvid": "urn:nasa:pds:ladee_ldex::1.2",
  "title": "LADEE LUNAR DUST EXPERIMENT",
  "product_class": "Product_Bundle",
  "_package_id": "9e7c65ec-cc82-4e2f-b2b7-365dc4d028e0"
}

The JSON file generated by Harvest can be loaded into Elasticsearch by Registry Manager tool.

Package ID

Each Harvest run generates unique package ID, stored in _package_id field. After loading extracted metadata into Elasticsearch, all documents from a particular Harvest run can be deleted by this package id.

Node Name

Node name is a required parameter which is used to tag ingested data with the node it is ingested by.

<harvest nodeName="PDS_SBN">
...

One of the following values can be used:

  • PDS_ATM - Planetary Data System: Atmospheres Node
  • PDS_ENG - Planetary Data System: Engineering Node
  • PDS_GEO - Planetary Data System: Geosciences Node
  • PDS_IMG - Planetary Data System: Imaging Node
  • PDS_NAIF - Planetary Data System: NAIF Node
  • PDS_RMS - Planetary Data System: Rings Node
  • PDS_SBN - Planetary Data System: Small Bodies Node at University of Maryland
  • PSA - Planetary Science Archive
  • JAXA - Japan Aerospace Exploration Agency
  • ROSCOSMOS - Russian State Corporation for Space Activities

This value is saved in "ops:Harvest_Info/ops:node_name" field:

{
  "lidvid": "urn:nasa:pds:ladee_ldex::1.2",
  "title": "LADEE LUNAR DUST EXPERIMENT",
  "ops:Harvest_Info/ops:node_name": "PDS_SBN",
...
}

Registry Integration

Harvest requires PDS Registry (Elasticsearch) to get LDD (schema) information and to find out if a product is already registered.

<harvest nodeName="PDS_SBN">
  ...
  <registry url="http://localhost:9200" index="registry" auth="/path/to/auth.cfg" />
  ...
</harvest>

<registry> attributes:

  • url - Registry (Elasticsearch) URL
  • index - Elasticsearch index name. This is an optional parameter. Default value is 'registry'.
  • auth - Registry (Elasticsearch) authentication configuration file. This is an optional parameter. The Registry security configuration is described in the following section.

Input Directories and Filters

Crawl Directories

To process products from one or more directories, add the following section in Harvest configuration file:

<harvest nodeName="PDS_SBN">
  ...
  <directories>
    <path>/some-directory/sub-dir-1/</path>
    <path>/some-directory/sub-dir-2/</path>
  </directories>
  ...
</harvest>

NOTE: You could not have both <directories> and <bundles> sections at the same time.

Crawl Bundles

To process products from one or more bundles, add the following section in Harvest configuration file:

<harvest nodeName="PDS_SBN">
  ...
  <bundles>
    <bundle dir="/data/geo/urn-nasa-pds-kaguya_grs_spectra" />
    <bundle dir="/data/geo/urn-nasa-pds-trang2020_moon_space_weathering" />
  </bundles>
  ...
</harvest>

NOTE: You could not have both <directories> and <bundles> sections at the same time.

Filtering Bundle Versions

Use "versions" attribute of the <bundle> tag to list versions of bundles to process. You can separate versions by comma, semicolon or space.

<harvest nodeName="PDS_SBN">
  ...
  <bundles>
    <bundle dir="/data/OREX/orex_spice" versions="7.0;8.0" />
  </bundles>
  ...
</harvest>

To process all versions you can use either versions="all" or no versions attribute at all.

<harvest nodeName="PDS_SBN">
  ...
  <bundles>
    <bundle dir="/data/OREX/orex_spice" versions="all" />
  </bundles>
  ...
</harvest>

Filtering Bundle's Collections

By default Harvest will process all collections listed in <Bundle_Member_Entry> section of a bundle. To process a subset of collections you can provide a list of lids or lidvids as shown below.

<!-- Filter by collection LID -->
<bundle dir="/data/OREX/orex_spice" versions="8.0" >
   <collection lid="urn:nasa:pds:orex.spice:spice_kernels" />
</bundle>

<!-- Filter by collection LIDVID -->
<bundle dir="/data/OREX/orex_spice" versions="8.0;7.0" >
   <collection lidvid="urn:nasa:pds:orex.spice:spice_kernels::8.0" />
   <collection lidvid="urn:nasa:pds:orex.spice:spice_kernels::7.0" />
</bundle>

Filtering Bundle's Product Directories

By default Harvest will process all products listed in the collection inventory file. To process a subset of products you can provide a list of directories.

<bundle dir="/data/OREX/orex_spice" versions="8.0" >
    <!-- Specify a substring in a relative (to the bundle root) directory name.  -->
    <product dir="/fk/" />
</bundle>

Filtering Products by Class

You can include or exclude products of a particular class. For example, to only process documents, add following product filter in Harvest configuration file:

<harvest nodeName="PDS_SBN">
  ...
  <productFilter>
    <includeClass>Product_Document</includeClass>
  </productFilter>
  ...
</harvest>

To exclude documents, add following product filter:

<harvest nodeName="PDS_SBN">
  ...
  <productFilter>
    <excludeClass>Product_Document</excludeClass>
  </productFilter>
  ...
</harvest>

NOTE: You could not have both include and exclude filters at the same time.

Extracting More Metadata

Label and Data File Information

By default, Harvest extracts label and data file information, such as file name, mime type, size, and MD5 hash.

Label:

"ops:Label_File_Info/ops:creation_date_time":"2020-11-18T22:25:05Z",
"ops:Label_File_Info/ops:file_name":"naif0012.xml",
"ops:Label_File_Info/ops:file_ref":"/C:/tmp/d5/naif0012.xml",
"ops:Label_File_Info/ops:file_size":"3398",
"ops:Label_File_Info/ops:md5_checksum":"69ea2974a93854d90399b8b8fc3d1334"

Data file:

"ops:Data_File_Info/ops:creation_date_time":"2020-11-18T22:25:17Z",
"ops:Data_File_Info/ops:file_name":"naif0012.tls",
"ops:Data_File_Info/ops:file_ref":"/C:/tmp/d5/naif0012.tls",
"ops:Data_File_Info/ops:file_size":"5257",
"ops:Data_File_Info/ops:md5_checksum":"25a2fff30b0dedb4d76c06727b1895b1",
"ops:Data_File_Info/ops:mime_type":"text/plain",

If you don't want to process data files, add the following flag in Harvest configuration file.

<fileInfo processDataFiles="false" />

BLOB Storage

By default, Harvest stores PDS product labels as BLOBs (Binary Large OBjects). Both original PDS product labels in XML format as well as product labels converted to JSON are stored. The data is compressed and stored in following fields: "ops/Label_File_Info/ops/blob" and "ops/Label_File_Info/ops/json_blob".

You can expect up to 900% compression rate for some files. For example, many LADEE housekeeping labels are about 45KB. Compressed BLOB size is about 5KB. For smaller files, such as collection labels, compression rate is about 350% (5.5KB file is compressed to 1.6KB).

After loading data into Elasticsearch, you can extract original labels by running Registry Manager tool:

registry-manager export-file -lidvid urn:nasa:pds:ladee_ldex:data_calibrated::1.2 -file /tmp/data_calibrated.xml

To disable BLOB storage, modify fileInfo section in Harvest configuration file.

<fileInfo storeLabels="false" storeJsonLabels="false" />

File Reference / Access URL

Harvest extracts absolute paths of product and label files, such as

"ops:Label_File_Info/ops:file_ref":"/tmp/d5/naif0012.xml",
"ops:Data_File_Info/ops:file_ref":"/tmp/d5/naif0012.tls",

Note that on Windows, backslashes are replaced with forward slashes and disk letter is included.

"ops:Label_File_Info/ops:file_ref":"C:/tmp/d4/bundle_orex_spice_v009.xml",

To replace a file path prefix with another value, such as a URL, add <fileRef/> tag in Harvest configuration file:

<fileInfo>
  <fileRef replacePrefix="/C:/tmp/d4/" with="https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/" />
</fileInfo>

After running Harvest, you should get different file_ref value:

"ops:Label_File_Info/ops:file_ref":
    "https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/bundle_orex_spice_v009.xml"

Date Fields

To store date fields in Elasticserch, the values have to be converted to ISO-8601 "instant" format (e.g., "2013-10-24T00:49:37.457Z"). Harvest will try to convert all fields containing "date" string in their names. If a field value could not be converted to ISO-8601 format, a warning message is printed and original value is saved.

To convert fields which don't have "date" string in their names you have to list field names in main Harvest configuration file as shown below.

  <autogenFields>
    <dateFields>
      <field>cassini:VIMS_Specific_Attributes/cassini:earth_received_start_time</field>
      <field>cassini:VIMS_Specific_Attributes/cassini:earth_received_stop_time</field>
      <field>cassini:VIMS_Specific_Attributes/cassini:start_time_doy</field>
      <field>cassini:VIMS_Specific_Attributes/cassini:stop_time_doy</field>
      <field>cassini:VIMS_Specific_Attributes/cassini:pds3_product_creation_time</field>
    </dateFields>
  </autogenFields>

Extract Metadata by XPath

To extract metadata by XPath, you have to create one or more mapping files and list them in Harvest configuration file as shown below.

<harvest nodeName="PDS_SBN">
...
  <xpathMaps baseDir="/home/pds/harvest/conf">
    <xpathMap filePath="common.xml" />
    <xpathMap rootElement="Product_Observational" filePath="observational.xml" />
  </xpathMaps>
</harvest>

In the example above there are two xpathMap entries. Each entry must have filePath attribute pointing to a mapping file. A path can be either absolute or relative to the baseDir attribute of the xpathMaps tag. The baseDir attribute is optional. The same example with absolute paths is shown below.

  <xpathMaps>
    <xpathMap filePath="/home/pds/harvest/conf/common.xml" />
    <xpathMap rootElement="Product_Observational" filePath="/home/pds/harvest/conf/observational.xml" />
  </xpathMaps>

An xpathMap entry can have optional rootElement attribute. Without this attribute, XPaths queries defined in a mapping file (common.xml), will run against every XML document processed by Harvest. With rootElement attribute, only XMLs with that root element will be processed.

Mapping Files

A mapping file has one or more entries which map an output field name to an XPath query. For example, to extract start_date_time and stop_date_time from observational products, you can use the following file.

<?xml version="1.0" encoding="UTF-8"?>
<xpaths>
  <xpath fieldName="start_date_time">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
  <xpath fieldName="stop_date_time">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>

You can use optional dataType="date" attribute to convert valid PDS dates to ISO-8601 "instant" format (e.g., "2013-10-24T00:49:37.457Z").

<xpaths>
  <xpath fieldName="start_date_time" dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
  <xpath fieldName="stop_date_time" dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>

XML Name Spaces

Harvest ignores namespaces when extracting metadata by XPath. Below is a fragment of LADEE UVS product label which uses "ladee" namespace for mission area fields.

  <Observation_Area>
    <Mission_Area>
      <ladee:latitude>17.2367925372247</ladee:latitude>
      <ladee:longitude>194.054477731391</ladee:longitude> 
      ...

To extract latitude and longitude you can use the following XPaths without namespaces.

<xpaths>
  <xpath fieldName="latitude">//Mission_Area/latitude</xpath>
  <xpath fieldName="longitude">//Mission_Area/longitude</xpath>
</xpaths>

Internal References

Harvest extracts lid and lidvid references from

  • <Internal_Reference> elements (references are stored in registry-docs.json)
  • <Bundle_Member_Entry> elements (references are stored in registry-docs.json)
  • <File_Area_Inventory> inventory files (references are stored in refs-docs.json)

The following naming convention is used for reference fields:

ref_<lid or lidvid>_<reference type>[_secondary]

For example,

"ref_lid_document":"urn:nasa:pds:ladee_uvs:document:DPSIS",
"ref_lid_target":"urn:nasa:pds:context:target:satellite.moon",
"ref_lidvid_product":"urn:nasa:pds:ladee_uvs:raw:0016o_0000::1.0",

By default both primary and secondary references are extracted. To extract only primary references add following section in Harvest configuration file:

<harvest nodeName="PDS_SBN">
  ...
  <references primaryOnly="true" />
  ...
<harvest>