Class ExternalTableDefn


public class ExternalTableDefn extends TableDefn
Definition of an external table, primarily for ingestion. The components are derived from those for Druid ingestion: an input source, a format and a set of columns. Also provides properties, as do all table definitions.

Partial Tables and Connections

An input source is a template for an external table. The input source says how to get data, and optionally the format and structure of that data. Since Druid never ingests the same data twice, the actual external table needs details that says which data to read on any specific ingestion. Thus, an external table is usually a "partial table": all the information that remains constant across ingestions, but without the information that changes. The changing information is typically the list of files (or objects or URLs) to ingest.

The pattern is:
external table spec + parameters --> external table

Since an input source is a parameterized (partial) external table, we can reuse the table metadata structures and APIs, avoiding the need to have a separate (but otherwise identical) structure for external tables. An external table can be thought of as a "connection", though Druid does not use that term. When used as a connection, the external table spec will omit the format. Instead, the format will also be provided at ingest time, along with the list of tables (or objects.)

To keep all this straight, we adopt the following terms:

External table spec
The JSON serialized version of an external table which can be partial or complete. The spec is a named entry in the Druid catalog
Complete spec
An external table spec that provides all information needed to access an external table. Each use identifies the same set of data. Useful if MSQ is used to query an external data source. A complete spec can be referenced as a first-class table in a FROM clause in an MSQ query.
Partial spec
An external table spec that omits some information. That information must be provided at query time in the form of a TABLE function. If the partial spec includes a format, then it is essentially a partial table. If it omits the format, then it is essentially a connection.
Completed table
The full external table that results from a partial spec and a set of SQL table function parameters.
Ad-hoc table
Users can define an external table using the generic EXTERN function or one of the input-source-specific functions. In this case, there is no catalog entry: all information comes from the SQL table function
Partial table function
The SQL table function used to "complete" a partial spec. The function defines parameters to fill in the missing information. The function is generated on demand and has the same name as the catalog entry for the partial spec. The function will include parameters for format if the catalog spec does not specify a format. Else, the format parameters are omitted and the completed table uses the format provided in the catalog spec.
Ad-hoc table function
The SQL table function used to create an ad-hoc external table. The function as a name defined by the InputFormatDefn, and has parameters for all support formats: the user must specify all input source and format properties.

External Table Structure

The external table is generic: it represents all valid combinations of input sources and formats. Rather than have a different table definition for each, we instead split out input sources and formats into their own definitions, and those definitions are integrated and used by this external table class. As a result, the properties field will contain the source property which has the JSON serialized form of the input source (minus items to be parameterized.)

Similarly, if the external table also defines a format (rather than requiring the format at ingest time), then the format property holds the JSON-serialized form of the input format, minus columns. The columns can be provided in the spec, in the columns field. The InputFormatDefn converts the columns to the form needed by the input format.

Druid's input sources all require formats. However, some sources may not actually need the format. A JDBC input source for example, needs no format. In other cases, there may be a subset of formats. Each InputSourceDefn is responsible for working out which formats (if any) are required. This class is agnostic about whether the format is supplied. (Remember that, when used as a connection, the external table will provide no format until ingest time.)

By contrast, the input source is always required.

Data Formats and Conversions

Much of the code here handles conversion of an external table specification to the form needed by SQL. Since SQL is not visible here, we instead create an instance of ExternalTableSpec which holds the input source, input format and row signature in the form required by SQL.

This class handles table specifications in three forms:

  1. From a fully-defined table specification, converted to a ExternalTableSpec by the convert(ResolvedTable) function.
  2. From a fully-defined set of arguments to a SQL table function. The InputSourceDefn.adHocTableFn() method provides the function definition which handles the conversion.
  3. From a partially-defined table specification in the catalog, augmented by parameters passed from a SQL function. The tableFn(ResolvedTable) method creates the required function by caching the table spec. That function then combines the parameters to produce the required ExternalTableSpec.

To handle these formats, and the need to adjust JSON, conversion to an ExternalTableSpec occurs in multiple steps:

  • When using a table spec, the serialized JSON is first converted to a generic Java map: one for the input source, another for the format.
  • When using a SQL function, the SQL arguments are converted (if needed) and written into a Java map. If the function references an existing table spec: then the JSON map is first populated with the deserialized spec.
  • Validation and/or adjustments are made to the Java map. Adjustments are those described elsewhere in this Javadoc.
  • The column specifications from either SQL or the table spec are converted to a list of column names, and placed into the Java map for the input format.
  • The maps are converted to the InputSource or InputFormat objects using a Jackson conversion.
The actual conversions are handled in the InputFormatDefn and InputFormatDefn classes, either directly (for a fully-defined table function) or starting here (for other use cases).

Property and Parameter Names

Pay careful attention to names: the names may be different in each of the above cases:
  • The table specification stores the input source and input format specs using the names defined by the classes themselves. That is, the table spec holds a string that represents the Jackson-serialized form of those classes. In some cases, the JSON can be a subset: some sources and formats have obscure checks, or options which are not available via this path. The code that does conversions will adjust the JSON prior to conversion. Each JSON object has a type field: the value of that type field must match that defined in the Jackson annotations for the corresponding class.
  • SQL table functions use argument names that are typically selected for user convenience, and may not be the same as the JSON field name. For example, a field name may be a SQL reserved word, or may be overly long, or may be obscure. The code for each input source and input format definition does the needed conversion.
  • Each input source and input format has a type. The input format type is given, in SQL by the format property. The format type name is typically the same as the JSON type name, but need not be.

Extensions

This class is designed to work both with "well known" Druid input sources and formats, and those defined in an extension. For extension-defined sources and formats to work, the extension must define an InputSourceDefn or InputFormatDefn which are put into the TableDefnRegistry and thus available to this class. The result is that this class is ignorant of the actual details of sources and formats: it instead delegates to the input source and input format definitions for that work.

Input sources and input formats defined in an extension are considered "ephemeral": they can go away if the corresponding extension is removed from the system. In that case, any table functions defined by those extensions are no longer available, and any SQL statements that use those functions will no longer work. The catalog may contain an external table spec that references those definitions. Such specs will continue to reside in the catalog, and can be retrieved, but they will fail any query that attempts to reference them.

  • Field Details

    • TABLE_TYPE

      public static final String TABLE_TYPE
      Identifier for external tables.
      See Also:
    • EXTERNAL_COLUMN_TYPE

      public static final String EXTERNAL_COLUMN_TYPE
      Column type for external tables.
      See Also:
    • SOURCE_PROPERTY

      public static final String SOURCE_PROPERTY
      Property which holds the input source specification as serialized as JSON.
      See Also:
    • FORMAT_PROPERTY

      public static final String FORMAT_PROPERTY
      Property which holds the optional input format specification, serialized as JSON.
      See Also:
    • MAP_TYPE_REF

      public static final com.fasterxml.jackson.core.type.TypeReference<Map<String,Object>> MAP_TYPE_REF
      Type reference used to deserialize JSON to a generic map.
  • Constructor Details

    • ExternalTableDefn

      public ExternalTableDefn()
  • Method Details

    • bind

      public void bind(TableDefnRegistry registry)
      Description copied from class: TableDefn
      Called after the table definition is added to the registry, along with all other definitions. Allows external tables to look up additional information, such as the set of input formats.
      Overrides:
      bind in class TableDefn
    • validate

      public void validate(ResolvedTable table)
      Description copied from class: TableDefn
      Validate a table spec using the table, field and column definitions defined here. The column definitions validate the type of each property value using the object mapper.
      Overrides:
      validate in class TableDefn
    • tableFn

      public TableFunction tableFn(ResolvedTable table)
      Return a table function definition for a partial table as given by the catalog table spec. The function defines parameters to gather the values needed to convert the partial table into a fully-defined table which can be converted to an ExternalTableSpec.
    • validateColumn

      protected void validateColumn(ColumnSpec colSpec)
      Description copied from class: TableDefn
      Table-specific validation of a column spec. Override for table definitions that need table-specific validation rules.
      Overrides:
      validateColumn in class TableDefn
    • convert

      public ExternalTableSpec convert(ResolvedTable table)
      Return the ExternalTableSpec for a catalog entry for a fully-defined table. This form exists for completeness, since ingestion never reads the same data twice. This form is handy for tests, and will become generally useful when MSQ fully supports queries and those queries can read from external tables.
    • isExternalTable

      public static boolean isExternalTable(ResolvedTable table)