Class TermNormalizer
- java.lang.Object
-
- de.julielab.jules.ae.genemapping.utils.norm.TermNormalizer
-
public class TermNormalizer extends java.lang.Object
-
-
Constructor Summary
Constructors Constructor Description TermNormalizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<java.lang.String>generateVariants(java.lang.String term)booleanisNonDescriptive(java.lang.String term)static voidmain(java.lang.String[] args)run the term normalizer on a file to be normalized (biothesaurus?)java.lang.Stringnormalize(java.lang.String term)normalize a single synonymvoidnormalizeFile(java.io.File inputFile, java.io.File outputFile)normalize all synonyms in a file (biothesaurus) where the first column is the synonym and the second column is id.java.lang.StringremoveNonDescriptives(java.lang.String term)java.util.List<java.lang.String>splitAwayRomanNumbers(java.util.List<java.lang.String> term)java.lang.StringstemNameTokens(java.lang.String normalizedTerm)Splits the input term at white spaces, stems the resulting tokens and joins the stemmed tokens with white spaces again.
-
-
-
Method Detail
-
main
public static void main(java.lang.String[] args) throws java.io.IOExceptionrun the term normalizer on a file to be normalized (biothesaurus?)- Parameters:
args-- Throws:
java.io.IOException
-
normalize
public java.lang.String normalize(java.lang.String term)
normalize a single synonym- Parameters:
term-- Returns:
-
generateVariants
public java.util.List<java.lang.String> generateVariants(java.lang.String term)
-
stemNameTokens
public java.lang.String stemNameTokens(java.lang.String normalizedTerm) throws java.io.IOExceptionSplits the input term at white spaces, stems the resulting tokens and joins the stemmed tokens with white spaces again.- Parameters:
normalizedTerm-- Returns:
- Throws:
java.io.IOException
-
normalizeFile
public void normalizeFile(java.io.File inputFile, java.io.File outputFile)normalize all synonyms in a file (biothesaurus) where the first column is the synonym and the second column is id. all other columns are ignored. columns have to be tab-separated.- Parameters:
inputFile- the input file (biothesaurus)outputFile- output file for normalized synonyms
-
splitAwayRomanNumbers
public java.util.List<java.lang.String> splitAwayRomanNumbers(java.util.List<java.lang.String> term)
-
removeNonDescriptives
public java.lang.String removeNonDescriptives(java.lang.String term)
-
isNonDescriptive
public boolean isNonDescriptive(java.lang.String term)
-
-