Registry Customization

Overview

Registry uses following Elasticsearch indices:

  • registry - this index stores metadata extracted from PDS4 labels, one ES document per PDS label.
  • registry-dd - this index stores data dictionary - a list of searchable fields and its data types. When registry is created, the data dictionary is populated with fields (attributes) from PDS common and few discipline dictionaries. You can add more fields as described in Data Dictionary section.
  • registry-refs - this index stores product references extracted from collection inventory files. There could be 1 or more ES documents per inventory file.

Default Elasticsearch schemas for these indices (registry.json, data-dic.json, and refs.json) are located in REGISTRY_HOME/elastic/ directory. Where REGISTRY_HOME is a directory where you installed Registry Manager, for example /home/pds/registry.

Default registry schema defines few common fields such as lid, vid, title, internal refrences and basic file information. Also, there is a binary field to store the whole PDS label as a BLOB. Lidvid is a primary key.

Dynamic field mapping is disabled to prevent creation of fields not in the data dictionary.

...
"mappings": {
    "dynamic": false, 
...
}

After running Harvest, the output folder (default is /tmp/harvest/out/) will have several files

  • registry-docs.json - metadata extracted from PDS4 labels, stored in Newline-delimited JSON (NJSON) format.
  • refs-docs.json - product references extracted from collection inventory files.
  • missing_fields.txt - a list of field names extracted from PDS4 labels to be added to the Registry.
  • missing_xsds.txt - a list of XSDs corresponding to fields listed in missing_fields.txt file.

When you load data

registry-manager load-data -dir /home/pds/harvest/out/

Registry Manager will try loading LDDs (JSON formatted data dictionaries) listed in missing_xsds.txt file and adding fields listed in missing_fields.txt file. You might get the following error if a field is not in the data dictinary.

[ERROR] Could not find datatype for field '...'

You should try fixing the errors by updating data dictionary as described in the following section.

You can also pass -force parameter to create missing fields with "keyword" (string) datatype.

registry-manager load-data -dir /home/pds/harvest/out -force

It is possible to disable schema update completely (not recommended).

registry-manager load-data -dir /home/pds/harvest/out -updateSchema n

If you disable schema update, fields not in Elasticsearch schema are not indexed, but they are still available in "_source" field.

Next section describes how to manage registry data dictionary.

Data Dictionary

When a registry is created, the data dictionary is populated with fields (attributes) from PDS common and few discipline dictionaries. Latest versions of PDS4 data dictionaries in different formates are available at PDS website.

The following naming convention is used for Elasticsearch fields:

namespace:Class_Name/namespace:attribute_name

For example,

disp:Display_Direction/disp:vertical_display_direction
geom:Articulation_Device_Parameters/geom:device_id
pds:XML_Schema/pds:name
proc:Software_Program/proc:name

Listing Installed Data Dictionaries

To list all data dictionaries in the Registry run the following command:

registry-manager list-dd
Namespace            File                                        Version   Date
-----------------------------------------------------------------------------------------------
cart                 PDS4_CART_1F00_1950.JSON                   1.15.0.0   2020-12-21T21:48:19Z
disp                 PDS4_DISP_1F00_1500.JSON                   1.15.0.0   2020-12-15T22:09:58Z
geom                 PDS4_GEOM_1F00_1910.JSON                   1.15.0.0   2021-01-12T00:37:40Z
img                  PDS4_IMG_1F00_1810.JSON                    1.15.0.0   2020-10-14T02:55:04Z
img_surface          PDS4_IMG_SURFACE_1F00_1240.JSON            1.15.0.0   2021-01-12T00:56:39Z
msn                  PDS4_MSN_1F00_1300.JSON                    1.15.0.0   2020-10-14T02:55:21Z
msn_surface          PDS4_MSN_SURFACE_1F00_1200.JSON            1.15.0.0   2020-10-14T02:55:29Z
particle             PDS4_PARTICLE_1G00_2010.JSON               1.16.0.0   2021-08-05T21:40:47Z
pds                  PDS4_PDS_1F00.JSON                         1.15.0.0   2020-12-23T15:16:28Z
proc                 PDS4_PROC_1F00_1210.JSON                   1.15.0.0   2020-12-09T03:22:22Z
rings                PDS4_RINGS_1F00_1A00.JSON                  1.15.0.0   2020-12-02T19:08:01Z
sp                   PDS4_SP_1F00_1300.JSON                     1.15.0.0   2020-11-03T19:47:46Z

Loading Data Dictionaries

You can load additional data dictionaries in following formats:

  • Standard PDS4 data dictionary JSON file, for example, orex_ldd_OREX_1300.JSON
  • Custom CSV file
  • Data dump created by 'export-dd' command (NJSON)

For example, to load standard PDS4 data dictionary, run the following command:

registry-manager load-dd -dd /home/pds/schema/orex_ldd_OREX_1300.JSON

Elasticsearch data dictionary schema has following fields.

  • class_ns - class namespace, e.g., "pds"
  • class_name - class name, e.g., "Element_Array"
  • attr_ns - attribute namespace, e.g., "pds"
  • attr_name - attribute name, e.g., "scaling_factor"
  • data_type - PDS data type, e.g., "ASCII_Real"
  • description - field description
  • es_field_name - Elasticsearch field name, e.g., "pds/Element_Array/pds/scaling_factor"
  • es_data_type - Elasticsearch data type, e.g., "double"

If you load standard PDS JSON data dictionary, all these fields are populated automatically. If you use custom CSV file, "es_field_name" and "es_data_type" are required and other fields are optional. The CSV file must have a header with data dictionary field names, for example,

es_field_name,es_data_type
my_namespace:My_Class/my_namespace:parameter_a,keyword
my_namespace:My_Class/my_namespace:parameter_b,integer
my_namespace:My_Class/my_namespace:parameter_c,double

To load custom CSV file, run the following command:

registry-manager load-dd -csv /home/pds/schema/my-fields.csv

PDS to Elasticsearch Data Type Mapping

If you load standard PDS JSON data dictionary, PDS data types, such as pds.ASCII_Real are automatically mapped to Elasticsearch data types, such as double. The mappings are stored in REGISTRY_HOME/elastic/data-dic-types.cfg file. You can modify this file, to add more mappings or change default values.

The file has the following format:

<PDS data type> = <Elasticsearch data type>

For example,

pds.ASCII_Integer = integer
pds.ASCII_Boolean = boolean
pds.ASCII_Real = double
pds.ASCII_Short_String_Collapsed = keyword
...

Registry Schema Update

As described in Overview section, registry schema is updated automatically when you load data.

You can also update registry schema by calling update-schema command:

registry-manager update-schema -dir /tmp/harvest/out/

Where -dir parameter points to Harvest output directory with missing_fields.txt and missing_xsds.txt files.