Harvest Overview
Harvest is a command-line tool for extracting metadata from PDS4 products (labels). It parses PDS4 files and stores extracted metadata in an intermediate XML file. This intermediate file can be loaded into Solr by Registry Manager or standard Solr tools, such as Solr post tool.
Harvest executable scripts for Windows (harvest.bat) and Linux / Mac (harvest) are located in bin sub-folder of the installation directory (e.g., /home/pds/harvest/).
To see the basic usage information (shown below) run Harvest without any parameters.
usage: harvest <options> -c <file> Harvest configuration file. -l <file> Log file. Default is /tmp/harvest/harvest.log. -o <dir> Output directory for Solr documents. Default is /tmp/harvest/solr -v <level> Logger verbosity: 0=Debug, 1=Info (default), 2=Warning, 3=Error.
Quick Start
To run Harvest you need an XML configuration file. For example, to process all XML files in /data/LADEE/ldex_20161118 folder and all subfolders, create the following configuration file and save it as /tmp/ladee.cfg.
<?xml version="1.0" encoding="UTF-8"?> <harvest> <directories> <path>/data/LADEE/ldex_20161118</path> <fileFilter> <include>*.xml</include> </fileFilter> </directories> </harvest>
Then run Harvest
harvest -c /tmp/ladee.cfg
The tool will print some log messages and create an intermediate file, solr-docs.xml in default output directory /tmp/harvest/solr/. The file contains a list of Solr documents in XML format whch can be loaded into Solr by Registry Manager or standard Solr tools. An example Solr document is shown below.
<doc> <field name="lid">urn:nasa:pds:ladee_ldex</field> <field name="vid">1.2</field> <field name="lidvid">urn:nasa:pds:ladee_ldex::1.2</field> <field name="title">LADEE LUNAR DUST EXPERIMENT</field> <field name="product_class">Product_Bundle</field> <field name="_file_name">bundle_ladee_ldex.xml</field> <field name="_file_type">application/xml</field> <field name="_file_size">5735</field> <field name="_file_md5">zlYAt05W/Ag6Qy4HlNYy+g==</field> <field name="_xml_root_element">Product_Bundle</field> <field name="_package_id">8627271a-01f5-49ad-8ce5-69f78fd6b5f4</field> <field name="instrument_host_ref">urn:nasa:pds:context:instrument_host:spacecraft.ladee</field> <field name="instrument_ref">urn:nasa:pds:context:instrument:instrument.ldex__ladee</field> <field name="investigation_ref">urn:nasa:pds:context:investigation:mission.ladee</field> <field name="target_ref">urn:nasa:pds:context:target:dust.dust</field> <field name="target_ref">urn:nasa:pds:context:target:satellite.moon</field> </doc>
By default Harvest extracts lid, vid, title, product_class, all internal refrences and basic file information, such as file name, type, size, and MD5 hash.
Package ID
Each Harvest run generates unique package ID, stored in _package_id field. After loading extracted metadata into Solr, all documents from a particular Harvest run can be deleted by this package id.
Extracting More Metadata
For example, to extract start_date_time and stop_date_time from observational products, you have to define an XPath to field name map. First, create an XML file shown below and save it as /home/pds/harvest/conf/observational.xml. You can use another file name or directory if you want.
<?xml version="1.0" encoding="UTF-8"?> <xpaths> <xpath fieldName="start_date_time">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath> <xpath fieldName="stop_date_time">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath> </xpaths>
These XPaths will be used to extract start and stop date values which will be saved in the intermediate XML file as start_date_time and stop_date_time fields.
Next, add the following section to the Harvest configuration file.
<harvest> ... <xpathMaps baseDir="/home/pds/harvest/conf"> <xpathMap rootElement="Product_Observational" filePath="observational.xml" /> </xpathMaps> </harvest>
Now, if you run Harvest, Solr documents for observational products will contain start and stop dates.
<doc> <field name="lid">urn:nasa:pds:ladee_ldex:data_derived:derived_ldex_ltden_pds_derived_tab</field> <field name="vid">1.2</field> ... <field name="start_date_time">2013-10-25T00:00:00Z</field> <field name="stop_date_time">2014-04-18T04:30:00Z</field> </doc>
Note that baseDir attribute is optional. You can also provide full path in filePath attribute as shown below
<xpathMaps> <xpathMap rootElement="Product_Observational" filePath="/home/pds/harvest/conf/observational.xml" /> </xpathMaps>
The rootElement attribute is also optional. In the above example, the XPath queries will run against observational products only. If you remove rootElement attribute, then the same XPath queries will run against all products, such as Product_Collection, Product_Document, etc.
Finally, you can have multiple <xpathMap> entries, even for the same rootElement.
BLOB Storage
You can store whole PDS product labels as BLOBs (Binary Large OBjects). To enable this feature add the following section in Harvest configuration file.
<blobStorage type="embedded" />
After running Harvest, solr-docs.xml "intermediate" file will have _file_blob field with zipped product label. You can expect up to 900% compression rate for some files. For example, many LADEE housekeeping labels are about 45KB. BLOB size is about 5KB. For smaller files, such as collection labels, compression rate is about 350% (5.5KB file is compressed to 1.6KB).
After loading data into Solr, you can extract BLOBs by running Registry Manager tool:
registry-manager export-file -lidvid urn:nasa:pds:ladee_ldex:data_calibrated::1.2 -filePath /tmp/data_calibrated.xml
File Reference / Access URL
To store full path of a product label file, add the following section in Harvest configuration file.
<fileRef/>
After running Harvest, you should see _file_ref field added to each Solr document:
<doc> ... <field name="_file_ref">/C:/data/LADEE/ldex_20161118/bundle_ladee_ldex.xml</field> ... </doc>
To replace file path prefix with another value, let's change <fileRef/> tag in Harvest configuration file:
<fileRef> <replace prefix="/C:/data/LADEE/" replacement="https://pds.nasa.gov/data/pds4/" /> </fileRef>
Now, after running Harvest, you shoul see different _file_ref value:
<doc> ... <field name="_file_ref">https://pds.nasa.gov/data/pds4/ldex_20161118/bundle_ladee_ldex.xml</field> ... </doc>
Directories and Filters
Crawl Products from Multiple Directories
To process products from multiple directories, specify multiple <path> entries in Harvest configuration file:
<harvest> <directories> <path>/data/LADEE/ldex_20161118/data_calibrated</path> <path>/data/LADEE/ldex_20161118/data_derived</path> ... </harvest>
Filtering Files
This feature has limited functionality since all PDS4 product labels are XML files and Harvest could not process other files. Usually you would include the following file filter in most Harvest configuration files:
<harvest> <directories> ... <fileFilter> <include>*.xml</include> </fileFilter> </directories> </harvest>
You can use one or more <include> filters or one or more <exclude> filters, but not both <include> and <exclude> at the same time.
Excluding Sub-Directories
For example, to exclude xml_schema sub-folder, add a directory filter in Harvest configuration file:
<harvest> <directories> ... <directoryFilter> <exclude>xml_schema</exclude> </directoryFilter> </directories> </harvest>
Filtering Products
You can include or exclude products. For example, to only process documents, add following product filter in Harvest configuration file:
<harvest> <directories> ... <productFilter> <include>Product_Document</include> </productFilter> </directories> </harvest>