Common Operations
- Extract Metadata (Harvest)
- Create, Delete, Customize Registry
- Load Metadata
- View, Search Metadata
- Delete Metadata
- Export Metadata
- Export files (BLOBs)
Extract PDS4 Product Metadata
Run Harvest tool to crawl PDS4 products and extract metadata in JSON (NJSON) format. In addition to some basic information, such as lid, vid, product class, internal references, file name and size, you can configure additional fields to export. Optionally the whole PDS product labels can be stored as BLOBs (Binary Large OBjects).
After running Harvest, the output folder (default is /tmp/harvest/out/) will have several files
- registry-docs.json - metadata extracted from PDS4 labels, stored in Newline-delimited JSON (NJSON) format.
- refs-docs.json - product references extracted from collection inventory files.
- fields.txt - a list of field names extracted from PDS4 labels.
See Harvest Documentation for more information.
Create, Delete, Customize Registry
You must create following registry indices in Elasticsearch, before loading data generated by Harvest tool.
- registry - this index stores metadata extracted from PDS4 labels, one ES document per PDS label.
- registry-dd - this index stores data dictionary - a list of searchable fields and its data types. When registry is created, the data dictionary is populated with fields (attributes) from PDS common and few discipline dictionaries. You can add more fields as described in Registry Customization / Data Dictionary section.
- registry-refs - this index stores product references extracted from collection inventory files. There could be 1 or more ES documents per inventory file.
Create Registry
To create registry indices in local Elasticsearch (running at http://localhost:9200) with 1 shard and 0 replicas, run the following Registry Manager command
registry-manager create-registry
You can customize create-registry command by passing several parameters, such as Elasticsearch URL, number of shards and replicas, authentication parameters. To see the list of available parameters and basic usage run
registry-manager create-registry -help
To check that registry indices were created open the following URL in a browser: http://localhost:9200/_cat/indices?v or use curl.
curl "http://localhost:9200/_cat/indices?v"
The response should look similar to this. Make sure that index status is "green".
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open registry PY6ObzELRlSx9gHOWbR8dw 1 0 0 0 208b 208b green open registry-dd CuJ-nqg1SbKI9hejHrISWA 1 0 2505 0 625kb 625kb green open registry-refs 1cJLc-9cQj2D_MAYo7gOpw 1 0 0 0 208b 208b
Delete Registry
To delete registry indices from local Elasticsearch run the following command:
registry-manager delete-registry
You can customize delete-registry command by passing several parameters, such as Elasticsearch URL, and authentication parameters.
Registry Customization
The main customization you would need to do is to update registry data dictionary of searchable fields and its datatypes. More information about registry customization is available here.
Load Metadata
JSON files generated by Harvest, can be loaded into Elasticsearch by Registry Manager as shown below.
registry-manager load-data -dir /home/pds/harvest/out/
Automatic Schema Update and Common Errors
By default, registry manager will try updating registry schema (add more fields) from fields.txt file (generated by Harvest) located in the same directory as es-docs.json.
You might see following error if you decide to copy es-docs.json file from Harvest output folder to another location and forget to copy fields.txt.
[ERROR] /my-folder/fields.txt (The system cannot find the file specified)
When registry is created, the registry data dictionary is populated with field definitions (field name to data type mappings) from PDS common and few discipline dictionaries. If you try loading labels with fields not defined in the registry data dictionary, you will get the following error:
[ERROR] Could not find datatype for field '...'
You have to update the registry data dictionary as described in Registry Customization section before you can load the data.
If you have non-standard registry configuration and know what you are doing, you can disable schema update by passing updateSchema parameter to load-data command.
registry-manager load-data -dir /home/pds/harvest/out/ -updateSchema n
Accidental Update of Existing Documents
Registry index uses lidvid as a primary key. If you load data with the same lidvids multiple times, old documents will be replaced by new documents. We plan to implement additional check to prevent accidental update of existing documents in next release.
View / Search Metadata
Elasticsearch Search API
You can either use simple Lucene queries, passed in the URL:
curl "http://localhost:9200/registry/_search?q=product_class:Product_Collection&pretty"
Or more advanced Elasticsearch queries defined in JSON and passed in request body:
curl -X GET "localhost:9200/registry/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "term": { "product_class": "Product_Collection" } } } '
You can find more about Elasticsearch Search API at Elasticsearch web site.
Delete Metadata
You can use Registry Manager tool to delete metadata by lidvid, lid, package id (Harvest run id), or to delete all data. Few examples are shown below.
registry-manager delete-data -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 registry-manager delete-data -lid urn:nasa:pds:context:target:asteroid.4_vesta registry-manager delete-data -packageId 8d8ae96d-044e-473d-a278-62635b1c5977 registry-manager delete-data -all
You can also use Elasticsearch delete by query API.
Export Metadata
You can use Registry Manager tool to export metadata by lidvid, package id (Harvest run id), or to export all data. Few examples are shown below.
registry-manager export-data -file /tmp/mydata.json -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 registry-manager export-data -file /tmp/mydata.json -packageId 8d8ae96d-044e-473d-a278-62635b1c5977 registry-manager export-data -file /tmp/mydata.json -all
Data is saved in a Newline Delimited JSON file which can be loaded into Elasticsearch by 'load-data' command. The same file format is used by Harvest and Elasticsearch bulk API.
Export Files (BLOBs)
If PDS product label BLOBs (Binary Large OBjects) were generated by Harvest, they can be exported by Registry Manager tool as shown below.
registry-manager export-file -lidvid urn:nasa:pds:context:target:asteroid.4_vesta::1.1 -file /tmp/4_vesta.xml