Class UnspecificNameFilter


  • public class UnspecificNameFilter
    extends Object
    Also known as the CrazyRegExFilter. Removes gene names that are too unspecific ("DNA binding protein") and those that most likely refer to a gene family, process, cell, disease, experimental technique (GST, LPS), etc.
    Author:
    Joerg
    • Field Detail

      • leftWordBoundary

        public static String leftWordBoundary
        characters that appear on the left side of a proper word: white space, bracket starts, ..
      • rightWordBoundary

        public static String rightWordBoundary
        characters that appear on the right side of a proper word: white space, comma, bracket ends, ..
      • blacklist_pattern

        public static String blacklist_pattern
    • Constructor Detail

      • UnspecificNameFilter

        public UnspecificNameFilter()
    • Method Detail

      • isUnspecific

        public static boolean isUnspecific​(String term)
        Parameters:
        term -
        Returns:
      • isUnspecific

        public static boolean isUnspecific​(String term,
                                           HashSet<Integer> speciesIDs)
        Decides whether a proposed gene name is too unspecific: mainly checks for compound names that do not have any reference concering the exact identity of a protein. Also removes units and diseases/syndromes, single letter names
        Examples: protease, DNA binding protein, nerve growth factor, human polymerase, cell-surface glycoprotein.
        Parameters:
        term -
        Returns:
      • isUnspecificSingleWord

        public static boolean isUnspecificSingleWord​(String name)
        Checks a gene name against a list of single word, case sensitive words that are pretty much always false positives: "aim", "fat", "up".
        Parameters:
        name -
        Returns:
      • isUnspecificSingleWordCaseInsensitive

        public static boolean isUnspecificSingleWordCaseInsensitive​(String name)
        Parameters:
        name -
        Returns:
      • isUnspecificAbbreviation

        public static boolean isUnspecificAbbreviation​(String name,
                                                       String sentence)
        Parameters:
        name -
        sentence -
        Returns:
      • isUnspecificAbbreviation

        public static boolean isUnspecificAbbreviation​(String name,
                                                       String sentence,
                                                       HashSet<Integer> speciesIDs)
        Returns:
      • isAminoAcid

        public static boolean isAminoAcid​(String name)
        Checks if the given term matches an amino acid, using three-letter code (Ala) and full names (Alanine, alanines). Does not check against one-letter codes (A).
        Parameters:
        name -
        Returns:
      • isDiseaseName

        public static boolean isDiseaseName​(String name)
        Returns true if the name represents a disease name.
        Parameters:
        name -
        Returns:
      • keepDiseaseName

        public static boolean keepDiseaseName​(String name,
                                              String sentence)
        Returns true if this disease name is mentioned together with a locus or genes/proteins in this sentence. In these cases, we prefer to keep the name of the gene.
        Parameters:
        name -
        sentence -
        Returns:
      • isTissueCellCompartment

        public static boolean isTissueCellCompartment​(String name)
        Parameters:
        name -
        Returns:
      • isCellLine

        public static boolean isCellLine​(String name,
                                         String sentence)
        Returns true if the name is followed by a keyword indicating that this name is a cell line.
        Parameters:
        name -
        sentence -
        Returns:
      • isSpecies

        public static boolean isSpecies​(String name)
        Naive implementation: checks if the name refers to a species.
        Parameters:
        name -
        Returns:
        true if the name refers to a species
      • isChromosome

        public static boolean isChromosome​(String name,
                                           String sentence)
        Checks if the given name found in a sentence refers to a (human!) chromosome:
        it should have at least a chromosome number and arm (13p), maybe also bands (13p.23), and the name should also appear next to the word 'chromosome'
        Parameters:
        name -
        sentence -
        Returns:
        true if the given name refers to a chromosome
      • isNegativePair

        public static boolean isNegativePair​(String name,
                                             String sentence)
        Some gene names refer to the actual gene only in very few cases. Examples are
        - GST (glutathione-S-transferase, an experimental technique for pulldowns),
        - LPS (lipopolysaccharide in most cases, not IRF6),
        - polymerase (... chain reaction), ...
        which need to be filtered out.
        Parameters:
        name -
        sentence -
        Returns:
      • textHasPlural

        public static boolean textHasPlural​(String name,
                                            String text)
        Parameters:
        name -
        Returns:
      • filter

        public void filter​(de.julielab.geneexpbase.genemodel.GeneDocument document)
        Removes all recognized gene names that are made up entirely of unspecific keywords or where names point to something else than a gene name, e.g. a tissue, species, amino acid.