Class CoercionUtils

java.lang.Object
com.linkedin.feathr.common.util.CoercionUtils

public class CoercionUtils extends Object
Utilities to coerce untyped data into feature vectors This is needed in order to make it easy to define features via simple expressions. As a motivating example, I might want to say "A" instead of {"A": 1}, or I might want to say ["A", "X", "Z"] instead of {"A":1, "X":1, "Z":1}.
  • Method Details

    • coerceToVector

      public static Map<String,Float> coerceToVector(Object item, FeatureTypes featureType)
      Coerce an item to term vector format according to the provided FeatureTypes. General rule for the dimension (terms) and value in the resulting term-vector:
      1. Terms are generated from Character (as is), CharSequence (as is) and Number (whole numbers only)
      2. Values are generated from the float representations of Numbers
      Specific rules for each FeatureTypes are as follows: FeatureTypes.BOOLEAN accepts Boolean as input and are interpreted as scalar feature value with the following encoding: true => { "": 1.0f } false => { } (empty map) FeatureTypes.NUMERIC accepts Number as input and it is interpreted as scalar value without any names/dimensions. The encoding produces vectors having "unit" dimension which we represent using the empty-string with the float value. 0.12345f => { "": 0.12345f } 100.01d => { "": 100.01f } BigDecimal.Ten => {"", 10f} {@link {@link FeatureTypes#DENSE_VECTOR} accepts List<Number> as input. The terms are the original index of element in the input list. [ 10.0f, 3.0f, 6.0f ] => { "0": 10.0f, "1": 3.0f, "2": 6.0f } [ 10, 20, 30 ] => { "0": 10f, "1": 20f, "2": 30f } FeatureTypes.CATEGORICAL accepts Character, CharSequence and whole Number as input. The term is the string representation of the input withe a value of 1.0f. "foo" => { "foo": 1.0f } 2.0000f => { "2", 1.0f } // whole number 2.000000000d => { "2", 1.0f } // whole number 1.4f => throws an error since 1.4f cannot be treated as a whole number FeatureTypes.CATEGORICAL_SET accepts List<String> as input and are represented as a term vector with each string term having a value of 1.0f. It doesn't matter how many times an element appears in the list, 1.0f is always given as the value. Values in the input list are de-duped. ["A", "B", "C"] => { "A": 1.0f, "B": 1.0f, "C": 1.0f } [100, 200, 300] => { "100": 1.0f, "200": 1.0f, "300": 1.0f } [100, 1.5f, 200] => throws since 1.5f cannot be treated as a whole number FeatureTypes.TERM_VECTOR accepts Map and List<Map> as inputs: 1) Standard Maps are interpreted as vectors with minimal munging to ensure the dimension can be safely encoded as a string and the value as a float. 2) List<Map> are first merged into a single map and handled as a Map as described above. In case of any repeated keys, an error will be thrown. { 123: 20.0f } => { "123": 20.0f } { 123.000f: 1.0d } => { "123": 1.0f } -> this is acceptable since 123.000f can be treated as the whole number 123 { 123: "1" } -> throws an error since "1" is not a Number { 123.1f: 1 } -> throws an error since 123.1f cannot be treated as a whole number
      Throws:
      RuntimeException - when feature type doesn't match its expected data format, will throw exception
    • coerceToVector

      public static Map<String,Float> coerceToVector(Object item)
      Coerce item to term vector map, try to infer the feature type according to its type Basic rules: 1. Treat single number (int, float, double, etc) as FeatureTypes.NUMERIC 2. Treat single string as categorical FeatureTypes.CATEGORICAL 3. Treat vector of numbers (int, float, double, etc) as FeatureTypes.DENSE_VECTOR 4. Treat a collection of strings as FeatureTypes.CATEGORICAL_SET 5. Treat map or list of maps as FeatureTypes.TERM_VECTOR 6. Treat FeatureValue as FeatureTypes.TERM_VECTOR The function may be used to handle default value from configuration json file, and handle field values extracted by MVEL expression
    • getCoercedFeatureType

      public static FeatureTypes getCoercedFeatureType(Object item)
      Get the feature type that the input item would be coerced to
      Parameters:
      item - input item to coerce
      Returns:
      coerced feature type
    • isNumeric

      public static boolean isNumeric(FeatureValue featureValue)
      Returns true if the input FeatureValue is a numeric feature and false otherwise Numeric features when represented as a term vector have the form: {""=3.0f}
    • isBoolean

      public static boolean isBoolean(FeatureValue featureValue)
      Returns true if the input FeatureValue is a boolean feature and false otherwise Boolean features when represented as a term vector has the form: ""=1.0f (true) and empty map (false)
    • isCategorical

      public static boolean isCategorical(FeatureValue featureValue)
      Returns true the input FeatureValue is a categorical feature and false otherwise Categorical features when represented as a term vector has the form: {"term"=1.0f}
    • safeToString

      public static String safeToString(Object item)
      Safely convert an input object into its string representation. Safe conversions are supported for the following:
      1. CharSequence and Character are simply converted to String
      2. Number converted to String from its longValue()
      Throws:
      RuntimeException - if the input Number is not a whole number within some precision
    • safeToFloat

      public static float safeToFloat(Object item)
      Safely convert an input object to its float representation. If the input is not a valid float, then an error will be thrown