Common Operations
- Extract Metadata (Harvest)
- Create, Delete, Customize Registry
- Load Metadata
- View, Search Metadata
- Delete Metadata
- Export Metadata
- Export files (BLOBs)
Extract PDS4 Product Metadata
Run Harvest tool to crawl PDS4 products and extract metadata in JSON (NJSON) format. In addition to some basic information, such as lid, vid, product class, internal references, file name and size, you can configure additional fields to export. Optionally the whole PDS product labels can be stored as BLOBs (Binary Large OBjects).
After running Harvest, the output folder (default is /tmp/harvest/out/) will have two files
- es-docs.json - metadata extracted from PDS4 labels, stored in Newline-delimited JSON (NJSON) format.
- fields.txt - a list of field names extracted from PDS4 labels.
See Harvest Documentation for more information.
Create, Delete, Customize Registry
You must create registry indices in Elasticsearch, before loading data generated by Harvest tool. See Registry Installation and Registry Manager for more information.
You may want to add more fields to the data dictionary or change default configuration. See Registry Customization for more information.
Load Metadata
Newline Delimited JSON file generated by Harvest, can be loaded into Elasticsearch by Registry Manager as shown below.
registry-manager load-data -file /home/pds/harvest/out/es-docs.json
Automatic Schema Update and Common Errors
By default, registry manager will try updating registry schema (add more fields) from fields.txt file (generated by Harvest) located in the same directory as es-docs.json.
You might see following error if you decide to copy es-docs.json file from Harvest output folder to another location and forget to copy fields.txt.
[ERROR] /my-folder/fields.txt (The system cannot find the file specified)
When registry is created, the registry data dictionary is populated with field definitions (field name to data type mappings) from PDS common and few discipline dictionaries. If you try loading labels with fields not defined in the registry data dictionary, you will get the following error:
[ERROR] Could not find datatype for field '...'
You have to update the registry data dictionary as described in Registry Customization section before you can load the data.
If you have non-standard registry configuration and know what you are doing, you can disable schema update by passing updateSchema parameter to load-data command.
registry-manager load-data -file /home/pds/harvest/out/es-docs.json -updateSchema n
Accidental Update of Existing Documents
Registry index uses lidvid as a primary key. If you load data with the same lidvids multiple times, old documents will be replaced by new documents. We plan to implement additional check to prevent accidental update of existing documents in next release.
View / Search Metadata
Elasticsearch Search API
You can either use simple Lucene queries, passed in the URL:
curl "http://localhost:9200/registry/_search?q=product_class:Product_Collection&pretty"
Or more advanced Elasticsearch queries defined in JSON and passed in request body:
curl -X GET "localhost:9200/registry/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "term": { "product_class": "Product_Collection" } } } '
You can find more about Elasticsearch Search API at Elasticsearch web site.
Delete Metadata
You can use Registry Manager tool to delete metadata by lidvid, lid, package id (Harvest run id), or to delete all data. Few examples are shown below.
registry-manager delete-data -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 registry-manager delete-data -lid urn:nasa:pds:context:target:asteroid.4_vesta registry-manager delete-data -packageId 8d8ae96d-044e-473d-a278-62635b1c5977 registry-manager delete-data -all
You can also use Elasticsearch delete by query API.
Export Metadata
You can use Registry Manager tool to export metadata by lidvid, package id (Harvest run id), or to export all data. Few examples are shown below.
registry-manager export-data -file /tmp/mydata.json -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 registry-manager export-data -file /tmp/mydata.json -packageId 8d8ae96d-044e-473d-a278-62635b1c5977 registry-manager export-data -file /tmp/mydata.json -all
Data is saved in a Newline Delimited JSON file which can be loaded into Elasticsearch by 'load-data' command. The same file format is used by Harvest and Elasticsearch bulk API.
Export Files (BLOBs)
If PDS product label BLOBs (Binary Large OBjects) were generated by Harvest, they can be exported by Registry Manager tool as shown below.
registry-manager export-file -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 -file /tmp/4_vesta.xml