Common Operations

Create, Delete, Customize Registry

You must create following registry indices in Elasticsearch, before running Harvest tool or loading data generated by Harvest.

  • registry - this index stores metadata extracted from PDS4 labels, one ES document per PDS label.
  • registry-dd - this index stores data dictionary - a list of searchable fields and its data types. When registry is created, the data dictionary is populated with fields (attributes) from PDS common and few discipline dictionaries. You can add more fields as described in Registry Customization / Data Dictionary section.
  • registry-refs - this index stores product references extracted from collection inventory files. There could be 1 or more ES documents per inventory file.

Create Registry

To create registry indices in local Elasticsearch (running at http://localhost:9200) with 1 shard and 0 replicas, run the following Registry Manager command

registry-manager create-registry

You can customize create-registry command by passing several parameters, such as Elasticsearch URL, number of shards and replicas, authentication parameters. To see the list of available parameters and basic usage run

registry-manager create-registry -help

To check that registry indices were created open the following URL in a browser: http://localhost:9200/_cat/indices?v or use curl.

curl "http://localhost:9200/_cat/indices?v"

The response should look similar to this. Make sure that index status is "green".

health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   registry      PY6ObzELRlSx9gHOWbR8dw   1   0          0            0       208b           208b
green  open   registry-dd   CuJ-nqg1SbKI9hejHrISWA   1   0       2505            0      625kb          625kb
green  open   registry-refs 1cJLc-9cQj2D_MAYo7gOpw   1   0          0            0       208b           208b

Delete Registry

To delete registry indices from local Elasticsearch run the following command:

registry-manager delete-registry

You can customize delete-registry command by passing several parameters, such as Elasticsearch URL, and authentication parameters.

Registry Customization

The main customization you would need to do is to update registry data dictionary of searchable fields and its datatypes. More information about registry customization is available here.

Extract PDS4 Product Metadata

Run Harvest tool to crawl PDS4 products and extract metadata from PDS4 labels. Extracted metadata is stored in JSON formatted data files.

After each Harvest run the following files are created in the output folder (default output folder is /tmp/harvest/out/):

  • registry-docs.json - metadata extracted from PDS4 labels, stored in Newline-delimited JSON (NJSON) format.
  • refs-docs.json - product references extracted from collection inventory files.
  • missing_fields.txt - a list of field names extracted from PDS4 labels to be added to the Registry.
  • missing_xsds.txt - a list of XSDs corresponding to fields lised in missing_fields.txt file.
  • supplemental.txt - a list of supplemental products (file paths).

See Harvest Documentation for more information.

Load Metadata

JSON files generated by Harvest, can be loaded into Elasticsearch by Registry Manager as shown below.

registry-manager load-data -dir /home/pds/harvest/out/

Automatic Schema Update and Common Errors

By default, registry manager will try updating registry fields and data dictionaries listed in missing_fields.txt and missing_xsds.txt files. Those files are created by Harvest tool in the output directory.

You might see following errors if you try to load data generated by old versions of Harvest.

[ERROR] /my-folder/missing_fields.txt (The system cannot find the file specified)
[ERROR] /my-folder/missing_xsds.txt (The system cannot find the file specified)

When registry is created, the registry data dictionary is populated with field definitions (field name to data type mappings) from PDS common and few discipline dictionaries. If you try loading labels with fields not defined in the registry data dictionary, you will get the following error:

[ERROR] Could not find datatype for field '...'

You have to update the registry data dictionary as described in Registry Customization section before you can load the data.

If you have non-standard registry configuration and know what you are doing, you can disable schema update by passing updateSchema parameter to load-data command.

registry-manager load-data -dir /home/pds/harvest/out/ -updateSchema n

Accidental Update of Existing Documents

Registry index uses lidvid as a primary key. If you load data with the same lidvids multiple times, old documents will be replaced by new documents. We plan to implement additional check to prevent accidental update of existing documents in next release.

View / Search Metadata

Elasticsearch Search API

You can either use simple Lucene queries, passed in the URL:

curl "http://localhost:9200/registry/_search?q=product_class:Product_Collection&pretty"

Or more advanced Elasticsearch queries defined in JSON and passed in request body:

curl -X GET "localhost:9200/registry/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": {
      "product_class": "Product_Collection"
    }
  }
}
'

You can find more about Elasticsearch Search API at Elasticsearch web site.

Delete Metadata

You can use Registry Manager tool to delete metadata by lidvid, lid, package id (Harvest run id), or to delete all data. Few examples are shown below.

registry-manager delete-data -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1
registry-manager delete-data -lid urn:nasa:pds:context:target:asteroid.4_vesta
registry-manager delete-data -packageId 8d8ae96d-044e-473d-a278-62635b1c5977
registry-manager delete-data -all

You can also use Elasticsearch delete by query API.

Export Metadata

You can use Registry Manager tool to export metadata by lidvid, package id (Harvest run id), or to export all data. Few examples are shown below.

registry-manager export-data -file /tmp/mydata.json -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1
registry-manager export-data -file /tmp/mydata.json -packageId 8d8ae96d-044e-473d-a278-62635b1c5977
registry-manager export-data -file /tmp/mydata.json -all

Data is saved in a Newline Delimited JSON file which can be loaded into Elasticsearch by 'load-data' command. The same file format is used by Harvest and Elasticsearch bulk API.

Export Files (BLOBs)

If PDS product label BLOBs (Binary Large OBjects) were generated by Harvest, they can be exported by Registry Manager tool as shown below.

registry-manager export-file -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 -file /tmp/4_vesta.xml