Harvest
- Overview
- Quick Start
- Package ID
- Node Name
- Registry Integration
- Input Directories and Filters
- Crawl Directories
- Crawl Bundles
- Filtering Bundle Versions
- Filtering Bundle's Collections
- Filtering Bundle's Product Directories
- Filtering Products by Class
- Extract More Metadata
- Basic File Information
- BLOB Storage
- File Reference / Access URL
- Extract Metadata by XPath
- Extract Metadata by Data Dictionary Class
- Internal References
Overview
Harvest is a command-line tool for extracting metadata from PDS4 products (labels). It parses PDS4 files and stores extracted metadata in a "newline delimited JSON" or XML file. The JSON file can be loaded into Elasticsearch by Registry Manager. The XML file can be loaded into Apache Solr by Solr "post" tool or from Solr admin UI.
Harvest executable scripts for Windows (harvest.bat) and Linux / Mac (harvest) are located in bin sub-folder of the installation directory (e.g., /home/pds/harvest/).
To see the basic usage information (shown below) run Harvest without any parameters.
Usage: harvest <options> Required parameters: -c <file> Harvest configuration file Optional parameters: -f <format> Output format ('json' or 'xml'). Default is 'json' -o <dir> Output directory. Default is /tmp/harvest/out -l <file> Log file. Default is /tmp/harvest/harvest.log -v <level> Logger verbosity: 0=Debug, 1=Info (default), 2=Warning, 3=Error
Quick Start
To run Harvest you need an XML configuration file. The configuration file has several sections which control which files the Harvest tool will crawl and what data to extract. Very basic configuration is shown below.
<?xml version="1.0" encoding="UTF-8"?> <harvest nodeName="PDS_SBN"> <directories> <path>/data/LADEE/ldex_20161118"</path> </directories> </harvest>
If you save this file as /tmp/ladee.cfg and run Harvest
harvest -c /tmp/ladee.cfg
all XML files in /data/LADEE/ldex_20161118 folder and its subfolders will be processed. By default, Harvest extracts only few metadata fields, such as lid, vid, title, and product class. Extracted metadata is saved in registry-docs.json file. Default output directory is /tmp/harvest/out/.
The JSON file has two lines per record / PDS label. The first line contains document ID, for example,
{"index":{"_id":"urn:nasa:pds:ladee_ldex::1.2"}}
The second line contains all metadata extracted from a PDS label. In the example below we split the JSON into multiple lines to make it more readable.
{ "lid": "urn:nasa:pds:ladee_ldex", "vid": 1.2, "lidvid": "urn:nasa:pds:ladee_ldex::1.2", "title": "LADEE LUNAR DUST EXPERIMENT", "product_class": "Product_Bundle", "_package_id": "9e7c65ec-cc82-4e2f-b2b7-365dc4d028e0" }
The JSON file generated by Harvest can be loaded into Elasticsearch by Registry Manager tool or by calling Elasticsearch bulk load API.
Package ID
Each Harvest run generates unique package ID, stored in _package_id field. After loading extracted metadata into Elasticsearch, all documents from a particular Harvest run can be deleted by this package id.
Node Name
Node name is a required parameter which is used to tag ingested data with the node it is ingested by.
<harvest nodeName="PDS_SBN"> ...
One of the following values can be used:
- PDS_ATM - Planetary Data System: Atmospheres Node
- PDS_ENG - Planetary Data System: Engineering Node
- PDS_GEO - Planetary Data System: Geosciences Node
- PDS_IMG - Planetary Data System: Imaging Node
- PDS_NAIF - Planetary Data System: NAIF Node
- PDS_RMS - Planetary Data System: Rings Node
- PDS_SBN - Planetary Data System: Small Bodies Node at University of Maryland
- PSA - Planetary Science Archive
- JAXA - Japan Aerospace Exploration Agency
- ROSCOSMOS - Russian State Corporation for Space Activities
This value is saved in "ops:Harvest_Info/ops:node_name" field:
{ "lidvid": "urn:nasa:pds:ladee_ldex::1.2", "title": "LADEE LUNAR DUST EXPERIMENT", "ops:Harvest_Info/ops:node_name": "PDS_SBN", ... }
Registry Integration
Harvest could query PDS Registry to find out if a product is already registered.
To point to a Registry to process only non-registered products, add the following optional configuration section.
<harvest nodeName="PDS_SBN"> ... <registry url="http://localhost:9200" index="registry" auth="/path/to/auth.cfg" /> ... </harvest>
<registry> attributes:
- url - Registry (Elasticsearch) URL
- index - Elasticsearch index name. This is an optional parameter. Default value is 'registry'.
- auth - Registry (Elasticsearch) authentication configuration file. This is an optional parameter. The Registry security configuration is described in the following section.
Input Directories and Filters
Crawl Directories
To process products from one or more directories, add the following section in Harvest configuration file:
<harvest nodeName="PDS_SBN"> ... <directories> <path>/some-directory/sub-dir-1/</path> <path>/some-directory/sub-dir-2/</path> </directories> ... </harvest>
NOTE: You could not have both <directories> and <bundles> sections at the same time.
Crawl Bundles
To process products from one or more bundles, add the following section in Harvest configuration file:
<harvest nodeName="PDS_SBN"> ... <bundles> <bundle dir="/data/geo/urn-nasa-pds-kaguya_grs_spectra" /> <bundle dir="/data/geo/urn-nasa-pds-trang2020_moon_space_weathering" /> </bundles> ... </harvest>
NOTE: You could not have both <directories> and <bundles> sections at the same time.
Filtering Bundle Versions
Use "versions" attribute of the <bundle> tag to list versions of bundles to process. You can separate versions by comma, semicolon or space.
<harvest nodeName="PDS_SBN"> ... <bundles> <bundle dir="/data/OREX/orex_spice" versions="7.0;8.0" /> </bundles> ... </harvest>
To process all versions you can use either versions="all" or no versions attribute at all.
<harvest nodeName="PDS_SBN"> ... <bundles> <bundle dir="/data/OREX/orex_spice" versions="all" /> </bundles> ... </harvest>
Filtering Bundle's Collections
By default Harvest will process all collections listed in <Bundle_Member_Entry> section of a bundle. To process a subset of collections you can provide a list of lids or lidvids as shown below.
<!-- Filter by collection LID --> <bundle dir="/data/OREX/orex_spice" versions="8.0" > <collection lid="urn:nasa:pds:orex.spice:spice_kernels" /> </bundle> <!-- Filter by collection LIDVID --> <bundle dir="/data/OREX/orex_spice" versions="8.0;7.0" > <collection lidvid="urn:nasa:pds:orex.spice:spice_kernels::8.0" /> <collection lidvid="urn:nasa:pds:orex.spice:spice_kernels::7.0" /> </bundle>
Filtering Bundle's Product Directories
By default Harvest will process all products listed in the collection inventory file. To process a subset of products you can provide a list of directories.
<bundle dir="/data/OREX/orex_spice" versions="8.0" > <!-- Specify a substring in a relative (to the bundle root) directory name. --> <product dir="/fk/" /> </bundle>
Filtering Products by Class
You can include or exclude products of a particular class. For example, to only process documents, add following product filter in Harvest configuration file:
<harvest nodeName="PDS_SBN"> ... <productFilter> <include>Product_Document</include> </productFilter> ... </harvest>
Extracting More Metadata
Label and Data File Information
To extracts label and data file information, such as file name, mime type, size, and MD5 hash, include the following section in the configuration file.
<fileInfo />
Now if you run Harvest, both the label file information
"ops:Label_File_Info/ops:creation_date_time":"2020-11-18T22:25:05Z", "ops:Label_File_Info/ops:file_name":"naif0012.xml", "ops:Label_File_Info/ops:file_ref":"/C:/tmp/d5/naif0012.xml", "ops:Label_File_Info/ops:file_size":"3398", "ops:Label_File_Info/ops:md5_checksum":"69ea2974a93854d90399b8b8fc3d1334"
and data file information
"ops:Data_File_Info/ops:creation_date_time":"2020-11-18T22:25:17Z", "ops:Data_File_Info/ops:file_name":"naif0012.tls", "ops:Data_File_Info/ops:file_ref":"/C:/tmp/d5/naif0012.tls", "ops:Data_File_Info/ops:file_size":"5257", "ops:Data_File_Info/ops:md5_checksum":"25a2fff30b0dedb4d76c06727b1895b1", "ops:Data_File_Info/ops:mime_type":"text/plain",
will be extracted.
If you don't want to process data files, add the following flag in Harvest configuration file.
<fileInfo processDataFiles="false" />
BLOB Storage
You can store whole PDS product labels as BLOBs (Binary Large OBjects). To enable this feature, modify fileInfo section in Harvest configuration file.
<fileInfo storeLabels="true" />
After running Harvest, es-docs.json file will have "ops/Label_File_Info/ops/blob" field with compressed product label. You can expect up to 900% compression rate for some files. For example, many LADEE housekeeping labels are about 45KB. Compressed BLOB size is about 5KB. For smaller files, such as collection labels, compression rate is about 350% (5.5KB file is compressed to 1.6KB).
After loading data into Elasticsearch, you can extract BLOBs by running Registry Manager tool:
registry-manager export-file -lidvid urn:nasa:pds:ladee_ldex:data_calibrated::1.2 -file /tmp/data_calibrated.xml
File Reference / Access URL
Harvest extracts absolute paths of product and label files, such as
"ops:Label_File_Info/ops:file_ref":"/tmp/d5/naif0012.xml", "ops:Data_File_Info/ops:file_ref":"/tmp/d5/naif0012.tls",
Note that on Windows, backslashes are replaced with forward slashes and disk letter is included.
"ops:Label_File_Info/ops:file_ref":"C:/tmp/d4/bundle_orex_spice_v009.xml",
To replace a file path prefix with another value, such as a URL, add <fileRef/> tag in Harvest configuration file:
<fileInfo> <fileRef replacePrefix="/C:/tmp/d4/" with="https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/" /> </fileInfo>
After running Harvest, you should get different file_ref value:
"ops:Label_File_Info/ops:file_ref": "https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/bundle_orex_spice_v009.xml"
Extract Metadata by XPath
To extract metadata by XPath, you have to create one or more mapping files and list them in Harvest configuration file as shown below.
<harvest nodeName="PDS_SBN"> ... <xpathMaps baseDir="/home/pds/harvest/conf"> <xpathMap filePath="common.xml" /> <xpathMap rootElement="Product_Observational" filePath="observational.xml" /> </xpathMaps> </harvest>
In the example above there are two xpathMap entries. Each entry must have filePath attribute pointing to a mapping file. A path can be either absolute or relative to the baseDir attribute of the xpathMaps tag. The baseDir attribute is optional. The same example with absolute paths is shown below.
<xpathMaps> <xpathMap filePath="/home/pds/harvest/conf/common.xml" /> <xpathMap rootElement="Product_Observational" filePath="/home/pds/harvest/conf/observational.xml" /> </xpathMaps>
An xpathMap entry can have optional rootElement attribute. Without this attribute, XPaths queries defined in a mapping file (common.xml), will run against every XML document processed by Harvest. With rootElement attribute, only XMLs with that root element will be processed.
Mapping Files
A mapping file has one or more entries which map an output field name to an XPath query. For example, to extract start_date_time and stop_date_time from observational products, you can use the following file.
<?xml version="1.0" encoding="UTF-8"?> <xpaths> <xpath fieldName="start_date_time">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath> <xpath fieldName="stop_date_time">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath> </xpaths>
You can use optional dataType="date" attribute to convert valid PDS dates to ISO-8601 "instant" format (e.g., "2013-10-24T00:49:37.457Z").
<xpaths> <xpath fieldName="start_date_time" dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath> <xpath fieldName="stop_date_time" dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath> </xpaths>
XML Name Spaces
Harvest ignores namespaces when extracting metadata by XPath. Below is a fragment of LADEE UVS product label which uses "ladee" namespace for mission area fields.
<Observation_Area> <Mission_Area> <ladee:latitude>17.2367925372247</ladee:latitude> <ladee:longitude>194.054477731391</ladee:longitude> ...
To extract latitude and longitude you can use the following XPaths without namespaces.
<xpaths> <xpath fieldName="latitude">//Mission_Area/latitude</xpath> <xpath fieldName="longitude">//Mission_Area/longitude</xpath> </xpaths>
Extract Metadata by Data Dictionary Class / Extract All Fields
Harvest can automatically "flatten out" PDS label files and generate field names using the following naming convention:
<namespace>:<class name>/<namespace>:<attribute name>
For example
"pds:Investigation_Area/pds:name":"LADEE", "pds:Investigation_Area/pds:type":"Mission", "pds:Mission_Area/ladee:activity_number":"16", "pds:Mission_Area/ladee:activity_type":"Occultation", "pds:Mission_Area/ladee:altitude":"245.731087458064",
To extract all fields, add the following section in Harvest configuration file.
<autogenFields />
To extract a subset of fields, add classFilter section which can have a list of either include or exclude filters (but not both). The following example will extract all fields from mission area.
<autogenFields> <classFilter> <include>pds:Mission_Area</include> </classFilter> </autogenFields>
Date Fields
Harvest will try to convert all fields containing "date" string in their names to ISO-8601 "instant" format (e.g., "2013-10-24T00:49:37.457Z"). If a field value could not be converted to ISO-8601 format, a warning message will be printed and original value will be saved.
Internal References
Harvest extracts lid and lidvid references from
- <Internal_Reference> elements (references are stored in registry-docs.json)
- <Bundle_Member_Entry> elements (references are stored in registry-docs.json)
- <File_Area_Inventory> inventory files (references are stored in refs-docs.json)
The following naming convention is used for reference fields:
ref_<lid or lidvid>_<reference type>[_secondary]
For example,
"ref_lid_document":"urn:nasa:pds:ladee_uvs:document:DPSIS", "ref_lid_target":"urn:nasa:pds:context:target:satellite.moon", "ref_lidvid_product":"urn:nasa:pds:ladee_uvs:raw:0016o_0000::1.0",
By default both primary and secondary references are extracted. To extract only primary references add following section in Harvest configuration file:
<harvest nodeName="PDS_SBN"> ... <references primaryOnly="true" /> ... <harvest>