Class TermNormalizer


  • public class TermNormalizer
    extends java.lang.Object
    • Constructor Summary

      Constructors 
      Constructor Description
      TermNormalizer()  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.String> generateVariants​(java.lang.String term)  
      boolean isNonDescriptive​(java.lang.String term)  
      static void main​(java.lang.String[] args)
      run the term normalizer on a file to be normalized (biothesaurus?)
      java.lang.String normalize​(java.lang.String term)
      normalize a single synonym
      void normalizeFile​(java.io.File inputFile, java.io.File outputFile)
      normalize all synonyms in a file (biothesaurus) where the first column is the synonym and the second column is id.
      java.lang.String removeNonDescriptives​(java.lang.String term)  
      java.util.List<java.lang.String> splitAwayRomanNumbers​(java.util.List<java.lang.String> term)  
      java.lang.String stemNameTokens​(java.lang.String normalizedTerm)
      Splits the input term at white spaces, stems the resulting tokens and joins the stemmed tokens with white spaces again.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TermNormalizer

        public TermNormalizer()
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.IOException
        run the term normalizer on a file to be normalized (biothesaurus?)
        Parameters:
        args -
        Throws:
        java.io.IOException
      • normalize

        public java.lang.String normalize​(java.lang.String term)
        normalize a single synonym
        Parameters:
        term -
        Returns:
      • generateVariants

        public java.util.List<java.lang.String> generateVariants​(java.lang.String term)
      • stemNameTokens

        public java.lang.String stemNameTokens​(java.lang.String normalizedTerm)
                                        throws java.io.IOException
        Splits the input term at white spaces, stems the resulting tokens and joins the stemmed tokens with white spaces again.
        Parameters:
        normalizedTerm -
        Returns:
        Throws:
        java.io.IOException
      • normalizeFile

        public void normalizeFile​(java.io.File inputFile,
                                  java.io.File outputFile)
        normalize all synonyms in a file (biothesaurus) where the first column is the synonym and the second column is id. all other columns are ignored. columns have to be tab-separated.
        Parameters:
        inputFile - the input file (biothesaurus)
        outputFile - output file for normalized synonyms
      • splitAwayRomanNumbers

        public java.util.List<java.lang.String> splitAwayRomanNumbers​(java.util.List<java.lang.String> term)
      • removeNonDescriptives

        public java.lang.String removeNonDescriptives​(java.lang.String term)
      • isNonDescriptive

        public boolean isNonDescriptive​(java.lang.String term)