Package de.julielab.genemapper.filtering
Class UnspecificNameFilter
- java.lang.Object
-
- de.julielab.genemapper.filtering.UnspecificNameFilter
-
public class UnspecificNameFilter extends Object
Also known as the CrazyRegExFilter. Removes gene names that are too unspecific ("DNA binding protein") and those that most likely refer to a gene family, process, cell, disease, experimental technique (GST, LPS), etc.- Author:
- Joerg
-
-
Field Summary
Fields Modifier and Type Field Description static Stringblacklist_patternstatic StringleftWordBoundarycharacters that appear on the left side of a proper word: white space, bracket starts, ..static StringrightWordBoundarycharacters that appear on the right side of a proper word: white space, comma, bracket ends, ..
-
Constructor Summary
Constructors Constructor Description UnspecificNameFilter()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidfilter(de.julielab.geneexpbase.genemodel.GeneDocument document)Removes all recognized gene names that are made up entirely of unspecific keywords or where names point to something else than a gene name, e.g.static booleanisAminoAcid(String name)Checks if the given term matches an amino acid, using three-letter code (Ala) and full names (Alanine, alanines).static booleanisCellLine(String name, String sentence)Returns true if the name is followed by a keyword indicating that this name is a cell line.static booleanisChromosome(String name, String sentence)Checks if the given name found in a sentence refers to a (human!) chromosome:
it should have at least a chromosome number and arm (13p), maybe also bands (13p.23), and the name should also appear next to the word 'chromosome'static booleanisDiseaseName(String name)Returns true if the name represents a disease name.static booleanisNegativePair(String name, String sentence)Some gene names refer to the actual gene only in very few cases.static booleanisSpecies(String name)Naive implementation: checks if the name refers to a species.static booleanisTissueCellCompartment(String name)static booleanisUnspecific(String term)static booleanisUnspecific(String term, HashSet<Integer> speciesIDs)Decides whether a proposed gene name is too unspecific: mainly checks for compound names that do not have any reference concering the exact identity of a protein.static booleanisUnspecificAbbreviation(String name, String sentence)static booleanisUnspecificAbbreviation(String name, String sentence, HashSet<Integer> speciesIDs)static booleanisUnspecificSingleWord(String name)Checks a gene name against a list of single word, case sensitive words that are pretty much always false positives: "aim", "fat", "up".static booleanisUnspecificSingleWordCaseInsensitive(String name)static booleankeepDiseaseName(String name, String sentence)Returns true if this disease name is mentioned together with a locus or genes/proteins in this sentence.static booleantextHasPlural(String name, String text)
-
-
-
Field Detail
-
leftWordBoundary
public static String leftWordBoundary
characters that appear on the left side of a proper word: white space, bracket starts, ..
-
rightWordBoundary
public static String rightWordBoundary
characters that appear on the right side of a proper word: white space, comma, bracket ends, ..
-
blacklist_pattern
public static String blacklist_pattern
-
-
Method Detail
-
isUnspecific
public static boolean isUnspecific(String term)
- Parameters:
term-- Returns:
-
isUnspecific
public static boolean isUnspecific(String term, HashSet<Integer> speciesIDs)
Decides whether a proposed gene name is too unspecific: mainly checks for compound names that do not have any reference concering the exact identity of a protein. Also removes units and diseases/syndromes, single letter names
Examples: protease, DNA binding protein, nerve growth factor, human polymerase, cell-surface glycoprotein.- Parameters:
term-- Returns:
-
isUnspecificSingleWord
public static boolean isUnspecificSingleWord(String name)
Checks a gene name against a list of single word, case sensitive words that are pretty much always false positives: "aim", "fat", "up".- Parameters:
name-- Returns:
-
isUnspecificSingleWordCaseInsensitive
public static boolean isUnspecificSingleWordCaseInsensitive(String name)
- Parameters:
name-- Returns:
-
isUnspecificAbbreviation
public static boolean isUnspecificAbbreviation(String name, String sentence)
- Parameters:
name-sentence-- Returns:
-
isUnspecificAbbreviation
public static boolean isUnspecificAbbreviation(String name, String sentence, HashSet<Integer> speciesIDs)
- Returns:
-
isAminoAcid
public static boolean isAminoAcid(String name)
Checks if the given term matches an amino acid, using three-letter code (Ala) and full names (Alanine, alanines). Does not check against one-letter codes (A).- Parameters:
name-- Returns:
-
isDiseaseName
public static boolean isDiseaseName(String name)
Returns true if the name represents a disease name.- Parameters:
name-- Returns:
-
keepDiseaseName
public static boolean keepDiseaseName(String name, String sentence)
Returns true if this disease name is mentioned together with a locus or genes/proteins in this sentence. In these cases, we prefer to keep the name of the gene.- Parameters:
name-sentence-- Returns:
-
isTissueCellCompartment
public static boolean isTissueCellCompartment(String name)
- Parameters:
name-- Returns:
-
isCellLine
public static boolean isCellLine(String name, String sentence)
Returns true if the name is followed by a keyword indicating that this name is a cell line.- Parameters:
name-sentence-- Returns:
-
isSpecies
public static boolean isSpecies(String name)
Naive implementation: checks if the name refers to a species.- Parameters:
name-- Returns:
- true if the name refers to a species
-
isChromosome
public static boolean isChromosome(String name, String sentence)
Checks if the given name found in a sentence refers to a (human!) chromosome:
it should have at least a chromosome number and arm (13p), maybe also bands (13p.23), and the name should also appear next to the word 'chromosome'- Parameters:
name-sentence-- Returns:
- true if the given name refers to a chromosome
-
isNegativePair
public static boolean isNegativePair(String name, String sentence)
Some gene names refer to the actual gene only in very few cases. Examples are
- GST (glutathione-S-transferase, an experimental technique for pulldowns),
- LPS (lipopolysaccharide in most cases, not IRF6),
- polymerase (... chain reaction), ...
which need to be filtered out.- Parameters:
name-sentence-- Returns:
-
textHasPlural
public static boolean textHasPlural(String name, String text)
- Parameters:
name-- Returns:
-
filter
public void filter(de.julielab.geneexpbase.genemodel.GeneDocument document)
Removes all recognized gene names that are made up entirely of unspecific keywords or where names point to something else than a gene name, e.g. a tissue, species, amino acid.
-
-