Package com.linkedin.feathr.common.util
Class CoercionUtils
java.lang.Object
com.linkedin.feathr.common.util.CoercionUtils
Utilities to coerce untyped data into feature vectors
This is needed in order to make it easy to define features via simple expressions.
As a motivating example, I might want to say "A" instead of {"A": 1},
or I might want to say ["A", "X", "Z"] instead of {"A":1, "X":1, "Z":1}.
-
Method Summary
Modifier and TypeMethodDescriptioncoerceToVector(Object item) Coerce item to term vector map, try to infer the feature type according to its type Basic rules: 1.coerceToVector(Object item, FeatureTypes featureType) Coerce an item to term vector format according to the providedFeatureTypes.static FeatureTypesgetCoercedFeatureType(Object item) Get the feature type that the input item would be coerced tostatic booleanisBoolean(FeatureValue featureValue) Returns true if the inputFeatureValueis a boolean feature and false otherwise Boolean features when represented as a term vector has the form: ""=1.0f (true) and empty map (false)static booleanisCategorical(FeatureValue featureValue) Returns true the inputFeatureValueis a categorical feature and false otherwise Categorical features when represented as a term vector has the form: {"term"=1.0f}static booleanisNumeric(FeatureValue featureValue) Returns true if the inputFeatureValueis a numeric feature and false otherwise Numeric features when represented as a term vector have the form: {""=3.0f}static floatsafeToFloat(Object item) Safely convert an input object to its float representation.static StringsafeToString(Object item) Safely convert an input object into its string representation.
-
Method Details
-
coerceToVector
Coerce an item to term vector format according to the providedFeatureTypes. General rule for the dimension (terms) and value in the resulting term-vector:- Terms are generated from
Character(as is),CharSequence(as is) andNumber(whole numbers only) - Values are generated from the float representations of
Numbers
FeatureTypesare as follows:FeatureTypes.BOOLEANacceptsBooleanas input and are interpreted as scalar feature value with the following encoding:true=> { "": 1.0f }false=> { } (empty map)FeatureTypes.NUMERICacceptsNumberas input and it is interpreted as scalar value without any names/dimensions. The encoding produces vectors having "unit" dimension which we represent using the empty-string with the float value. 0.12345f => { "": 0.12345f } 100.01d => { "": 100.01f } BigDecimal.Ten => {"", 10f} {@link {@link FeatureTypes#DENSE_VECTOR} acceptsList<Number>as input. The terms are the original index of element in the input list. [ 10.0f, 3.0f, 6.0f ] => { "0": 10.0f, "1": 3.0f, "2": 6.0f } [ 10, 20, 30 ] => { "0": 10f, "1": 20f, "2": 30f }FeatureTypes.CATEGORICALacceptsCharacter,CharSequenceand wholeNumberas input. The term is the string representation of the input withe a value of 1.0f. "foo" => { "foo": 1.0f } 2.0000f => { "2", 1.0f } // whole number 2.000000000d => { "2", 1.0f } // whole number 1.4f => throws an error since 1.4f cannot be treated as a whole numberFeatureTypes.CATEGORICAL_SETacceptsList<String>as input and are represented as a term vector with each string term having a value of 1.0f. It doesn't matter how many times an element appears in the list, 1.0f is always given as the value. Values in the input list are de-duped. ["A", "B", "C"] => { "A": 1.0f, "B": 1.0f, "C": 1.0f } [100, 200, 300] => { "100": 1.0f, "200": 1.0f, "300": 1.0f } [100, 1.5f, 200] => throws since 1.5f cannot be treated as a whole numberFeatureTypes.TERM_VECTORacceptsMapandList<Map>as inputs: 1) StandardMaps are interpreted as vectors with minimal munging to ensure the dimension can be safely encoded as a string and the value as a float. 2)List<Map>are first merged into a single map and handled as a Map as described above. In case of any repeated keys, an error will be thrown. { 123: 20.0f } => { "123": 20.0f } { 123.000f: 1.0d } => { "123": 1.0f } -> this is acceptable since 123.000f can be treated as the whole number 123 { 123: "1" } -> throws an error since "1" is not aNumber{ 123.1f: 1 } -> throws an error since 123.1f cannot be treated as a whole number- Throws:
RuntimeException- when feature type doesn't match its expected data format, will throw exception
- Terms are generated from
-
coerceToVector
Coerce item to term vector map, try to infer the feature type according to its type Basic rules: 1. Treat single number (int, float, double, etc) asFeatureTypes.NUMERIC2. Treat single string as categoricalFeatureTypes.CATEGORICAL3. Treat vector of numbers (int, float, double, etc) asFeatureTypes.DENSE_VECTOR4. Treat a collection of strings asFeatureTypes.CATEGORICAL_SET5. Treat map or list of maps asFeatureTypes.TERM_VECTOR6. TreatFeatureValueasFeatureTypes.TERM_VECTORThe function may be used to handle default value from configuration json file, and handle field values extracted by MVEL expression -
getCoercedFeatureType
Get the feature type that the input item would be coerced to- Parameters:
item- input item to coerce- Returns:
- coerced feature type
-
isNumeric
Returns true if the inputFeatureValueis a numeric feature and false otherwise Numeric features when represented as a term vector have the form: {""=3.0f} -
isBoolean
Returns true if the inputFeatureValueis a boolean feature and false otherwise Boolean features when represented as a term vector has the form: ""=1.0f (true) and empty map (false) -
isCategorical
Returns true the inputFeatureValueis a categorical feature and false otherwise Categorical features when represented as a term vector has the form: {"term"=1.0f} -
safeToString
Safely convert an input object into its string representation. Safe conversions are supported for the following:CharSequenceandCharacterare simply converted to StringNumberconverted to String from itslongValue()
- Throws:
RuntimeException- if the inputNumberis not a whole number within some precision
-
safeToFloat
Safely convert an input object to its float representation. If the input is not a valid float, then an error will be thrown
-