Class GeneDocument
- java.lang.Object
-
- de.julielab.jules.ae.genemapping.genemodel.GeneDocument
-
public class GeneDocument extends java.lang.Object
-
-
Field Summary
Fields Modifier and Type Field Description static java.util.regex.PatternecNumberRegExpstatic java.util.regex.PatternlociRegExp
-
Constructor Summary
Constructors Constructor Description GeneDocument()GeneDocument(GeneDocument template)Copies the template document.GeneDocument(java.lang.String id)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddGene(GeneMention gene)voidagglomerateByAcronyms()Merges those gene sets that are connected via acronym resolution or, for gene mentions that are not covered by any acronym, merges by name.voidagglomerateByNames()voidallowGeneMentionsByRegularExpression(GeneMention.GeneTagger tagger, java.util.regex.Pattern... regExes)Adds gene mentions to the selected set of gene mentions based on a tagger (optional) and regular expressions matched on the mention string.AhoCorasickOptimizedbuildGeneNameTrie()Builds an instance ofAhoCorasickOptimizedfrom the currently selected genes.booleanequals(java.lang.Object o)AcronymLongformgetAcronymLongformAndOffsets(Acronym acronym)de.julielab.java.utilities.spanutils.OffsetMap<AcronymLongform>getAcronymLongforms()de.julielab.java.utilities.spanutils.OffsetMap<Acronym>getAcronyms()java.util.List<GeneMention>getAllGenes()Returns the raw gene mentions in this document, without any filtering, unification, aggregation or whatsoever and possibly from multiple taggers.de.julielab.java.utilities.spanutils.OffsetMap<java.lang.String>getChunks()java.lang.StringgetCoveredText(int begin, int end)java.lang.StringgetCoveredText(de.julielab.java.utilities.spanutils.Span span)java.lang.StringgetCoveredText(org.apache.commons.lang3.Range<java.lang.Integer> range)java.lang.StringgetDefaultSpecies()java.lang.StringgetDocumentText()java.lang.StringgetDocumentTitle()de.julielab.java.utilities.spanutils.OffsetMap<java.util.List<GeneMention>>getGeneMap()java.util.stream.Stream<GeneMention>getGeneMentionsAtOffsets(org.apache.commons.lang3.Range<java.lang.Integer> offsets)AhoCorasickOptimizedgetGeneNameDictionary()java.util.stream.Stream<GeneMention>getGenes()Returns those genes that have been selected from the original set of all genes.GeneSetsgetGeneSets()On first call, creates a trivial GeneSets object where each gene is in its own set.java.lang.Iterable<GeneMention>getGenesIterable()java.util.Iterator<GeneMention>getGenesIterator()java.util.stream.Stream<GeneMention>getGenesWithText(java.lang.String text)java.lang.StringgetId()java.util.Optional<PosTag>getLastPosTag(org.apache.commons.lang3.Range<java.lang.Integer> range, java.util.Set<java.lang.String> excludedTags)java.util.Collection<MeshHeading>getMeshHeadings()java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,SpeciesMention>getNearestNextSpeciesMention(org.apache.commons.lang3.Range<java.lang.Integer> range, java.lang.String taxId)java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,SpeciesMention>getNearestPreviousSpeciesMention(org.apache.commons.lang3.Range<java.lang.Integer> range, java.lang.String taxId)org.apache.commons.lang3.Range<java.lang.Integer>getovappingSentence(de.julielab.java.utilities.spanutils.Span span)org.apache.commons.lang3.Range<java.lang.Integer>getovappingSentence(org.apache.commons.lang3.Range<java.lang.Integer> range)java.util.Collection<AcronymLongform>getOverlappingAcronymLongforms(org.apache.commons.lang3.Range<java.lang.Integer> range)java.util.Collection<Acronym>getOverlappingAcronyms(org.apache.commons.lang3.Range<java.lang.Integer> range)Returns acronyms (not full forms!) overlapping with the given range.java.util.Set<java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,java.lang.String>>getOverlappingChunks(org.apache.commons.lang3.Range<java.lang.Integer> range)Returns chunks overlapping with the given range.java.util.Set<java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,java.lang.String>>getOverlappingChunks(org.apache.commons.lang3.Range<java.lang.Integer> range, java.lang.String chunkType)Returns chunks of the given type overlapping with the given range.java.util.stream.Stream<GeneMention>getOverlappingGenes(org.apache.commons.lang3.Range<java.lang.Integer> range)Returns genes overlapping with the given range.java.util.stream.Stream<GeneMention>getOverlappingGoldGenes(org.apache.commons.lang3.Range<java.lang.Integer> range)java.util.Collection<PosTag>getOverlappingPosTags(org.apache.commons.lang3.Range<java.lang.Integer> range)de.julielab.java.utilities.spanutils.OffsetMap<PosTag>getPosTags()java.util.NavigableSet<org.apache.commons.lang3.Range<java.lang.Integer>>getSentences()SpeciesCandidatesgetSpecies()TermNormalizergetTermNormalizer()inthashCode()voidputGoldGene(GeneMention gm)voidremoveGene(GeneMention gm)voidremoveGenesWithoutCandidates()voidremoveSpeciesMention(com.fulmicoton.multiregexp.MultiPatternSearcher searcher)Removes all prefixes belonging to a species, e.g.voidselectAllGenes()Builds the internal gene offset map with all available genes, overlapping or not.voidselectGene(GeneMention gm)Adds the given GeneMention to the set of currently selected genes but not to the allGenes set.voidselectGeneMentionsByTagger(GeneMention.GeneTagger... tagger)Builds the internal gene offset map and only keeps gene mentions found by the given taggers.voidsetAcronyms(de.julielab.java.utilities.spanutils.OffsetMap<Acronym> acronyms)voidsetAcronyms(Acronym... acronyms)voidsetAcronyms(java.util.Collection<Acronym> acronyms)voidsetAcronyms(java.util.stream.Stream<Acronym> acronyms)voidsetChunks(de.julielab.java.utilities.spanutils.OffsetMap<java.lang.String> chunks)voidsetDefaultSpecies(java.lang.String defaultSpecies)voidsetDocumentText(java.lang.String documentText)voidsetDocumentTitle(java.lang.String documentTitle)voidsetGenes(GeneMention... genes)voidsetGenes(java.util.Collection<GeneMention> genes)voidsetGenes(java.util.stream.Stream<GeneMention> genes)voidsetId(java.lang.String id)voidsetMeshHeadings(java.util.Collection<MeshHeading> meshHeadings)voidsetPosTags(java.util.Collection<PosTag> posTags)voidsetPosTags(java.util.stream.Stream<PosTag> posTags)voidsetSentences(de.julielab.java.utilities.spanutils.OffsetSet sentences)voidsetSpecies(SpeciesCandidates species)com.google.common.collect.Multimap<java.lang.String,GeneSpeciesOccurrence>setSpeciesHints(GeneMention gm)This will try to map genes to species using a multi-stage procedure as detailed in "Inter-species normalization of gene mentions with GNAT" by Hakenberg et al.voidsetSpeciesMeshHeadings(java.util.Collection<MeshHeading> meshHeadings)voidsetTermNormalizer(TermNormalizer termNormalizer)voidunifyAcronymsLongerFirst()voidunifyAllGenesLongerFirst()Unifies all genes with the longer-span-first strategy.voidunifyAllGenesLongerFirst(GeneMention.GeneTagger... taggers)voidunifyGeneMentionsAtEqualOffsets(GeneMention.GeneTagger... taggerPriorities)Creates the internal gene map without allowing exact duplicate ranges where begin and end are equal but still allows overlapping.voidunifyGenesPrioritizeTagger(java.util.NavigableSet<GeneMention> sortedGenes, GeneMention.GeneTagger tagger)
-
-
-
Constructor Detail
-
GeneDocument
public GeneDocument()
-
GeneDocument
public GeneDocument(java.lang.String id)
-
GeneDocument
public GeneDocument(GeneDocument template)
Copies the template document. This is mostly a shallow copy, except the genes. Those are deeply copied and put into the respective structures (the "genes" and "geneSets" fields).- Parameters:
template- The document to copy.
-
-
Method Detail
-
getAcronymLongformAndOffsets
public AcronymLongform getAcronymLongformAndOffsets(Acronym acronym)
-
getAcronyms
public de.julielab.java.utilities.spanutils.OffsetMap<Acronym> getAcronyms()
-
setAcronyms
public void setAcronyms(de.julielab.java.utilities.spanutils.OffsetMap<Acronym> acronyms)
-
setAcronyms
public void setAcronyms(Acronym... acronyms)
-
setAcronyms
public void setAcronyms(java.util.Collection<Acronym> acronyms)
-
setAcronyms
public void setAcronyms(java.util.stream.Stream<Acronym> acronyms)
-
getAcronymLongforms
public de.julielab.java.utilities.spanutils.OffsetMap<AcronymLongform> getAcronymLongforms()
-
getChunks
public de.julielab.java.utilities.spanutils.OffsetMap<java.lang.String> getChunks()
-
setChunks
public void setChunks(de.julielab.java.utilities.spanutils.OffsetMap<java.lang.String> chunks)
-
getDocumentText
public java.lang.String getDocumentText()
-
setDocumentText
public void setDocumentText(java.lang.String documentText)
-
getDocumentTitle
public java.lang.String getDocumentTitle()
-
setDocumentTitle
public void setDocumentTitle(java.lang.String documentTitle)
-
getGeneMap
public de.julielab.java.utilities.spanutils.OffsetMap<java.util.List<GeneMention>> getGeneMap()
-
getGeneMentionsAtOffsets
public java.util.stream.Stream<GeneMention> getGeneMentionsAtOffsets(org.apache.commons.lang3.Range<java.lang.Integer> offsets)
-
getGenes
public java.util.stream.Stream<GeneMention> getGenes()
Returns those genes that have been selected from the original set of all genes. Thus, before this method works, a selection method has to be called first.- Returns:
- The currently selected genes.
- See Also:
selectGeneMentionsByTagger(GeneTagger...),unifyGeneMentionsAtEqualOffsets(GeneTagger...)
-
setGenes
public void setGenes(GeneMention... genes)
-
setGenes
public void setGenes(java.util.stream.Stream<GeneMention> genes)
-
setGenes
public void setGenes(java.util.Collection<GeneMention> genes)
-
addGene
public void addGene(GeneMention gene)
-
getGenesIterable
public java.lang.Iterable<GeneMention> getGenesIterable()
-
getGenesIterator
public java.util.Iterator<GeneMention> getGenesIterator()
-
getGeneSets
public GeneSets getGeneSets()
On first call, creates a trivial GeneSets object where each gene is in its own set. From here, one can begin to agglomerate sets e.g. due to the same name, an acronym connection or other measures. Subsequent calls will return the same set instance.- Returns:
- A GeneSets object where each gene has its own set.
-
getId
public java.lang.String getId()
-
setId
public void setId(java.lang.String id)
-
getOverlappingAcronyms
public java.util.Collection<Acronym> getOverlappingAcronyms(org.apache.commons.lang3.Range<java.lang.Integer> range)
Returns acronyms (not full forms!) overlapping with the given range.- Parameters:
range- An offset range.- Returns:
- Acronyms overlapping the given range.
-
getOverlappingAcronymLongforms
public java.util.Collection<AcronymLongform> getOverlappingAcronymLongforms(org.apache.commons.lang3.Range<java.lang.Integer> range)
-
getovappingSentence
public org.apache.commons.lang3.Range<java.lang.Integer> getovappingSentence(de.julielab.java.utilities.spanutils.Span span)
-
getovappingSentence
public org.apache.commons.lang3.Range<java.lang.Integer> getovappingSentence(org.apache.commons.lang3.Range<java.lang.Integer> range)
-
getOverlappingChunks
public java.util.Set<java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,java.lang.String>> getOverlappingChunks(org.apache.commons.lang3.Range<java.lang.Integer> range)
Returns chunks overlapping with the given range.- Parameters:
range- An offset range.- Returns:
- Chunks overlapping the given range.
-
getOverlappingChunks
public java.util.Set<java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,java.lang.String>> getOverlappingChunks(org.apache.commons.lang3.Range<java.lang.Integer> range, java.lang.String chunkType)Returns chunks of the given type overlapping with the given range.- Parameters:
range- An offset range.chunkType- The chunk type - e.g. ChunkNP - to return.- Returns:
- Chunks with the given type overlapping the given range.
-
getOverlappingPosTags
public java.util.Collection<PosTag> getOverlappingPosTags(org.apache.commons.lang3.Range<java.lang.Integer> range)
-
getLastPosTag
public java.util.Optional<PosTag> getLastPosTag(org.apache.commons.lang3.Range<java.lang.Integer> range, java.util.Set<java.lang.String> excludedTags)
-
getPosTags
public de.julielab.java.utilities.spanutils.OffsetMap<PosTag> getPosTags()
-
setPosTags
public void setPosTags(java.util.Collection<PosTag> posTags)
-
setPosTags
public void setPosTags(java.util.stream.Stream<PosTag> posTags)
-
getOverlappingGenes
public java.util.stream.Stream<GeneMention> getOverlappingGenes(org.apache.commons.lang3.Range<java.lang.Integer> range)
Returns genes overlapping with the given range.- Parameters:
range- An offset range.- Returns:
- Genes overlapping the given range.
-
getOverlappingGoldGenes
public java.util.stream.Stream<GeneMention> getOverlappingGoldGenes(org.apache.commons.lang3.Range<java.lang.Integer> range)
-
getSentences
public java.util.NavigableSet<org.apache.commons.lang3.Range<java.lang.Integer>> getSentences()
-
setSentences
public void setSentences(de.julielab.java.utilities.spanutils.OffsetSet sentences)
-
getSpecies
public SpeciesCandidates getSpecies()
-
setSpecies
public void setSpecies(SpeciesCandidates species)
-
setSpeciesHints
public com.google.common.collect.Multimap<java.lang.String,GeneSpeciesOccurrence> setSpeciesHints(GeneMention gm)
This will try to map genes to species using a multi-stage procedure as detailed in "Inter-species normalization of gene mentions with GNAT" by Hakenberg et al. (2008).- Parameters:
gm- A mention of a gene- Returns:
- A map of all mentioned species found on the first stage that contains any. In case no species can be inferred, this will be an empty map.
- See Also:
GeneSpeciesOccurrence
-
removeGenesWithoutCandidates
public void removeGenesWithoutCandidates()
-
removeSpeciesMention
public void removeSpeciesMention(com.fulmicoton.multiregexp.MultiPatternSearcher searcher)
Removes all prefixes belonging to a species, e.g. "human FGF-22" will be turned into "FGF-22"- Parameters:
searcher- A MultiPatternSearcher containing a compiled multi-regex of all species to be considered.
-
selectAllGenes
public void selectAllGenes()
Builds the internal gene offset map with all available genes, overlapping or not. Offset duplicates will be override items that have been in the offset map before their addition.
-
selectGeneMentionsByTagger
public void selectGeneMentionsByTagger(GeneMention.GeneTagger... tagger)
Builds the internal gene offset map and only keeps gene mentions found by the given taggers.- Parameters:
tagger- The taggers for which gene mentions should be kept.
-
allowGeneMentionsByRegularExpression
public void allowGeneMentionsByRegularExpression(GeneMention.GeneTagger tagger, java.util.regex.Pattern... regExes)
Adds gene mentions to the selected set of gene mentions based on a tagger (optional) and regular expressions matched on the mention string.- Parameters:
tagger- Optional, may be nullregExes- A list of regular expressions. Each gene mention matching one of the expressions (and, if given, the tagger) will be added to the selected list of genes.
-
unifyGeneMentionsAtEqualOffsets
public void unifyGeneMentionsAtEqualOffsets(GeneMention.GeneTagger... taggerPriorities)
Creates the internal gene map without allowing exact duplicate ranges where begin and end are equal but still allows overlapping.- Parameters:
taggerPriorities- The order in which should be decided which gene mention to keep at a given position with multiple candidates at the exact same location. A lower position means higher priority. Non-mentioned taggers have minimum priority, e.g. are most easily discarded.
-
unifyAcronymsLongerFirst
public void unifyAcronymsLongerFirst()
-
unifyAllGenesLongerFirst
public void unifyAllGenesLongerFirst()
Unifies all genes with the longer-span-first strategy.
-
unifyAllGenesLongerFirst
public void unifyAllGenesLongerFirst(GeneMention.GeneTagger... taggers)
-
unifyGenesPrioritizeTagger
public void unifyGenesPrioritizeTagger(java.util.NavigableSet<GeneMention> sortedGenes, GeneMention.GeneTagger tagger)
-
getAllGenes
public java.util.List<GeneMention> getAllGenes()
Returns the raw gene mentions in this document, without any filtering, unification, aggregation or whatsoever and possibly from multiple taggers.- Returns:
- All gene mentions in this document.
-
putGoldGene
public void putGoldGene(GeneMention gm)
-
getCoveredText
public java.lang.String getCoveredText(de.julielab.java.utilities.spanutils.Span span)
-
getCoveredText
public java.lang.String getCoveredText(org.apache.commons.lang3.Range<java.lang.Integer> range)
-
getCoveredText
public java.lang.String getCoveredText(int begin, int end)
-
selectGene
public void selectGene(GeneMention gm)
Adds the given GeneMention to the set of currently selected genes but not to the allGenes set.- Parameters:
gm- The gene mention to add.
-
getTermNormalizer
public TermNormalizer getTermNormalizer()
-
setTermNormalizer
public void setTermNormalizer(TermNormalizer termNormalizer)
-
removeGene
public void removeGene(GeneMention gm)
-
getGeneNameDictionary
public AhoCorasickOptimized getGeneNameDictionary()
-
buildGeneNameTrie
public AhoCorasickOptimized buildGeneNameTrie()
Builds an instance ofAhoCorasickOptimizedfrom the currently selected genes. The instance is stored internally and can also be retrieved bygetGeneNameDictionary().- Returns:
- A trie dictionary compiled from the names (text occurrence) of all selected genes.
-
agglomerateByAcronyms
public void agglomerateByAcronyms()
Merges those gene sets that are connected via acronym resolution or, for gene mentions that are not covered by any acronym, merges by name.
-
agglomerateByNames
public void agglomerateByNames()
-
equals
public boolean equals(java.lang.Object o)
- Overrides:
equalsin classjava.lang.Object
-
hashCode
public int hashCode()
- Overrides:
hashCodein classjava.lang.Object
-
setSpeciesMeshHeadings
public void setSpeciesMeshHeadings(java.util.Collection<MeshHeading> meshHeadings)
-
getMeshHeadings
public java.util.Collection<MeshHeading> getMeshHeadings()
-
setMeshHeadings
public void setMeshHeadings(java.util.Collection<MeshHeading> meshHeadings)
-
getGenesWithText
public java.util.stream.Stream<GeneMention> getGenesWithText(java.lang.String text)
-
getDefaultSpecies
public java.lang.String getDefaultSpecies()
-
setDefaultSpecies
public void setDefaultSpecies(java.lang.String defaultSpecies)
-
getNearestPreviousSpeciesMention
public java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,SpeciesMention> getNearestPreviousSpeciesMention(org.apache.commons.lang3.Range<java.lang.Integer> range, java.lang.String taxId)
-
getNearestNextSpeciesMention
public java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,SpeciesMention> getNearestNextSpeciesMention(org.apache.commons.lang3.Range<java.lang.Integer> range, java.lang.String taxId)
-
-