Registry Customization

Overview

When you create a registry

registry-manager create-registry

two Elasticsearch indices are created

  • registry - main index for PDS4 product metadata.
  • registry-dd - data dictionary, a list of searchable fields that main registry index can have.

Default Elasticsearch schemas for these indices (registry.json and data-dic.json) are located in REGISTRY_HOME/elastic/ directory. Where REGISTRY_HOME is a directory where you installed Registry Manager, for example /home/pds/registry.

Default registry schema defines few common fields such as lid, vid, title, internal refrences and basic file information. Also, there is a binary field to store the whole PDS label as a BLOB. Lidvid is a primary key.

Dynamic field mapping is disabled.

...
"mappings": {
    "dynamic": false, 
...
}

After running Harvest, the output folder (default is /tmp/harvest/out/) will have two files

  • es-docs.json - metadata extracted from PDS4 labels, stored in Newline-delimited JSON (NJSON) format.
  • fields.txt - a list of field names extracted from PDS4 labels.

When you load data

registry-manager load-data -file /home/pds/harvest/out/es-docs.json

by default, registry manager will try to update registry schema (add more fields) from fields.txt file. Those fields must exist in the data dictionary index. You can disable schema update by passing extra parameter

registry-manager load-data -file /home/pds/harvest/out/es-docs.json -updateSchema n

If you do that, fields not in Elasticsearch schema are not indexed, but they are still available in "_source" field.

Data Dictionary

When registry is created, the data dictionary is populated with fields (attributes) from PDS common and few discipline dictionaries. Latest versions of PDS4 data dictionaries in different formates are available at PDS website.

The following naming convention is used for Elasticsearch fields:

<namespace>/<class name>/<namespace>/<attribute name>

For example,

disp/Display_Direction/disp/vertical_display_direction
geom/Articulation_Device_Parameters/geom/device_id
pds/XML_Schema/pds/name
proc/Software_Program/proc/name

You can load more data dictionaries in following formats:

  • Standard PDS4 data dictionary JSON file, for example, orex_ldd_OREX_1300.JSON
  • Data dump created by 'export-dd' command (NJSON)
  • Custom CSV file

For example, to load standard PDS4 data dictionary, run the following command:

registry-manager load-dd -dd /home/pds/schema/orex_ldd_OREX_1300.JSON

Elasticsearch data dictionary schema has following fields.

  • class_ns - class namespace, e.g., "pds"
  • class_name - class name, e.g., "Element_Array"
  • attr_ns - attribute namespace, e.g., "pds"
  • attr_name - attribute name, e.g., "scaling_factor"
  • data_type - PDS data type, e.g., "ASCII_Real"
  • description - field description
  • es_field_name - Elasticsearch field name, e.g., "pds/Element_Array/pds/scaling_factor"
  • es_data_type - Elasticsearch data type, e.g., "double"

If you load standard PDS JSON data dictionary, all these fields are populated automatically. If you use custom CSV file, "es_field_name" and "es_data_type" are required and other fields are optional. The CSV file must have a header with data dictionary field names, for example,

es_field_name,es_data_type
my_namespace/My_Class/my_namespace/parameter_a,keyword
my_namespace/My_Class/my_namespace/parameter_b,integer
my_namespace/My_Class/my_namespace/parameter_c,double

To load custom CSV file, run the following command:

registry-manager load-dd -csv /home/pds/schema/my-fields.csv

PDS to Elasticsearch Data Type Mapping

If you load standard PDS JSON data dictionary, PDS data types, such as pds.ASCII_Real are automatically mapped to Elasticsearch data types, such as double. The mappings are stored in REGISTRY_HOME/elastic/data-dic-types.cfg file. You can modify this file, to add more mappings or change default values.

The file has the following format:

<PDS data type> = <Elasticsearch data type>

For example,

pds.ASCII_Integer = integer
pds.ASCII_Boolean = boolean
pds.ASCII_Real = double
pds.ASCII_Short_String_Collapsed = keyword
...

Registry Schema Update

As described in Overview section, registry schema is updated automatically when you load data. You can disable schema update by passing extra parameter to load-data command:

registry-manager load-data -file /home/pds/harvest/out/es-docs.json -updateSchema n

You can also update registry schema by calling update-schema command:

registry-manager update-schema -file /home/pds/harvest/out/fields.txt

You can either use fields.txt generated by Harvest or use your own file. The file contains a list of field names you want to add, one entry per line. The field definition must exist in the data dictionary index. Existing fields in registry collection are ignored.