Class GeneDocument


  • public class GeneDocument
    extends java.lang.Object
    • Field Detail

      • ecNumberRegExp

        public static final java.util.regex.Pattern ecNumberRegExp
      • lociRegExp

        public static final java.util.regex.Pattern lociRegExp
    • Constructor Detail

      • GeneDocument

        public GeneDocument()
      • GeneDocument

        public GeneDocument​(java.lang.String id)
      • GeneDocument

        public GeneDocument​(GeneDocument template)
        Copies the template document. This is mostly a shallow copy, except the genes. Those are deeply copied and put into the respective structures (the "genes" and "geneSets" fields).
        Parameters:
        template - The document to copy.
    • Method Detail

      • getAcronyms

        public de.julielab.java.utilities.spanutils.OffsetMap<Acronym> getAcronyms()
      • setAcronyms

        public void setAcronyms​(de.julielab.java.utilities.spanutils.OffsetMap<Acronym> acronyms)
      • setAcronyms

        public void setAcronyms​(Acronym... acronyms)
      • setAcronyms

        public void setAcronyms​(java.util.Collection<Acronym> acronyms)
      • setAcronyms

        public void setAcronyms​(java.util.stream.Stream<Acronym> acronyms)
      • getAcronymLongforms

        public de.julielab.java.utilities.spanutils.OffsetMap<AcronymLongform> getAcronymLongforms()
      • getChunks

        public de.julielab.java.utilities.spanutils.OffsetMap<java.lang.String> getChunks()
      • setChunks

        public void setChunks​(de.julielab.java.utilities.spanutils.OffsetMap<java.lang.String> chunks)
      • getDocumentText

        public java.lang.String getDocumentText()
      • setDocumentText

        public void setDocumentText​(java.lang.String documentText)
      • getDocumentTitle

        public java.lang.String getDocumentTitle()
      • setDocumentTitle

        public void setDocumentTitle​(java.lang.String documentTitle)
      • getGeneMap

        public de.julielab.java.utilities.spanutils.OffsetMap<java.util.List<GeneMention>> getGeneMap()
      • getGeneMentionsAtOffsets

        public java.util.stream.Stream<GeneMention> getGeneMentionsAtOffsets​(org.apache.commons.lang3.Range<java.lang.Integer> offsets)
      • setGenes

        public void setGenes​(GeneMention... genes)
      • setGenes

        public void setGenes​(java.util.stream.Stream<GeneMention> genes)
      • setGenes

        public void setGenes​(java.util.Collection<GeneMention> genes)
      • getGenesIterable

        public java.lang.Iterable<GeneMention> getGenesIterable()
      • getGenesIterator

        public java.util.Iterator<GeneMention> getGenesIterator()
      • getGeneSets

        public GeneSets getGeneSets()
        On first call, creates a trivial GeneSets object where each gene is in its own set. From here, one can begin to agglomerate sets e.g. due to the same name, an acronym connection or other measures. Subsequent calls will return the same set instance.
        Returns:
        A GeneSets object where each gene has its own set.
      • getId

        public java.lang.String getId()
      • setId

        public void setId​(java.lang.String id)
      • getOverlappingAcronyms

        public java.util.Collection<Acronym> getOverlappingAcronyms​(org.apache.commons.lang3.Range<java.lang.Integer> range)
        Returns acronyms (not full forms!) overlapping with the given range.
        Parameters:
        range - An offset range.
        Returns:
        Acronyms overlapping the given range.
      • getOverlappingAcronymLongforms

        public java.util.Collection<AcronymLongform> getOverlappingAcronymLongforms​(org.apache.commons.lang3.Range<java.lang.Integer> range)
      • getovappingSentence

        public org.apache.commons.lang3.Range<java.lang.Integer> getovappingSentence​(de.julielab.java.utilities.spanutils.Span span)
      • getovappingSentence

        public org.apache.commons.lang3.Range<java.lang.Integer> getovappingSentence​(org.apache.commons.lang3.Range<java.lang.Integer> range)
      • getOverlappingChunks

        public java.util.Set<java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,​java.lang.String>> getOverlappingChunks​(org.apache.commons.lang3.Range<java.lang.Integer> range)
        Returns chunks overlapping with the given range.
        Parameters:
        range - An offset range.
        Returns:
        Chunks overlapping the given range.
      • getOverlappingChunks

        public java.util.Set<java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,​java.lang.String>> getOverlappingChunks​(org.apache.commons.lang3.Range<java.lang.Integer> range,
                                                                                                                                                 java.lang.String chunkType)
        Returns chunks of the given type overlapping with the given range.
        Parameters:
        range - An offset range.
        chunkType - The chunk type - e.g. ChunkNP - to return.
        Returns:
        Chunks with the given type overlapping the given range.
      • getOverlappingPosTags

        public java.util.Collection<PosTag> getOverlappingPosTags​(org.apache.commons.lang3.Range<java.lang.Integer> range)
      • getLastPosTag

        public java.util.Optional<PosTag> getLastPosTag​(org.apache.commons.lang3.Range<java.lang.Integer> range,
                                                        java.util.Set<java.lang.String> excludedTags)
      • getPosTags

        public de.julielab.java.utilities.spanutils.OffsetMap<PosTag> getPosTags()
      • setPosTags

        public void setPosTags​(java.util.Collection<PosTag> posTags)
      • setPosTags

        public void setPosTags​(java.util.stream.Stream<PosTag> posTags)
      • getOverlappingGenes

        public java.util.stream.Stream<GeneMention> getOverlappingGenes​(org.apache.commons.lang3.Range<java.lang.Integer> range)
        Returns genes overlapping with the given range.
        Parameters:
        range - An offset range.
        Returns:
        Genes overlapping the given range.
      • getOverlappingGoldGenes

        public java.util.stream.Stream<GeneMention> getOverlappingGoldGenes​(org.apache.commons.lang3.Range<java.lang.Integer> range)
      • getSentences

        public java.util.NavigableSet<org.apache.commons.lang3.Range<java.lang.Integer>> getSentences()
      • setSentences

        public void setSentences​(de.julielab.java.utilities.spanutils.OffsetSet sentences)
      • setSpeciesHints

        public com.google.common.collect.Multimap<java.lang.String,​GeneSpeciesOccurrence> setSpeciesHints​(GeneMention gm)
        This will try to map genes to species using a multi-stage procedure as detailed in "Inter-species normalization of gene mentions with GNAT" by Hakenberg et al. (2008).
        Parameters:
        gm - A mention of a gene
        Returns:
        A map of all mentioned species found on the first stage that contains any. In case no species can be inferred, this will be an empty map.
        See Also:
        GeneSpeciesOccurrence
      • removeGenesWithoutCandidates

        public void removeGenesWithoutCandidates()
      • removeSpeciesMention

        public void removeSpeciesMention​(com.fulmicoton.multiregexp.MultiPatternSearcher searcher)
        Removes all prefixes belonging to a species, e.g. "human FGF-22" will be turned into "FGF-22"
        Parameters:
        searcher - A MultiPatternSearcher containing a compiled multi-regex of all species to be considered.
      • selectAllGenes

        public void selectAllGenes()
        Builds the internal gene offset map with all available genes, overlapping or not. Offset duplicates will be override items that have been in the offset map before their addition.
      • selectGeneMentionsByTagger

        public void selectGeneMentionsByTagger​(GeneMention.GeneTagger... tagger)
        Builds the internal gene offset map and only keeps gene mentions found by the given taggers.
        Parameters:
        tagger - The taggers for which gene mentions should be kept.
      • allowGeneMentionsByRegularExpression

        public void allowGeneMentionsByRegularExpression​(GeneMention.GeneTagger tagger,
                                                         java.util.regex.Pattern... regExes)
        Adds gene mentions to the selected set of gene mentions based on a tagger (optional) and regular expressions matched on the mention string.
        Parameters:
        tagger - Optional, may be null
        regExes - A list of regular expressions. Each gene mention matching one of the expressions (and, if given, the tagger) will be added to the selected list of genes.
      • unifyGeneMentionsAtEqualOffsets

        public void unifyGeneMentionsAtEqualOffsets​(GeneMention.GeneTagger... taggerPriorities)
        Creates the internal gene map without allowing exact duplicate ranges where begin and end are equal but still allows overlapping.
        Parameters:
        taggerPriorities - The order in which should be decided which gene mention to keep at a given position with multiple candidates at the exact same location. A lower position means higher priority. Non-mentioned taggers have minimum priority, e.g. are most easily discarded.
      • unifyAcronymsLongerFirst

        public void unifyAcronymsLongerFirst()
      • unifyAllGenesLongerFirst

        public void unifyAllGenesLongerFirst()
        Unifies all genes with the longer-span-first strategy.
      • getAllGenes

        public java.util.List<GeneMention> getAllGenes()
        Returns the raw gene mentions in this document, without any filtering, unification, aggregation or whatsoever and possibly from multiple taggers.
        Returns:
        All gene mentions in this document.
      • putGoldGene

        public void putGoldGene​(GeneMention gm)
      • getCoveredText

        public java.lang.String getCoveredText​(de.julielab.java.utilities.spanutils.Span span)
      • getCoveredText

        public java.lang.String getCoveredText​(org.apache.commons.lang3.Range<java.lang.Integer> range)
      • getCoveredText

        public java.lang.String getCoveredText​(int begin,
                                               int end)
      • selectGene

        public void selectGene​(GeneMention gm)
        Adds the given GeneMention to the set of currently selected genes but not to the allGenes set.
        Parameters:
        gm - The gene mention to add.
      • setTermNormalizer

        public void setTermNormalizer​(TermNormalizer termNormalizer)
      • removeGene

        public void removeGene​(GeneMention gm)
      • buildGeneNameTrie

        public AhoCorasickOptimized buildGeneNameTrie()
        Builds an instance of AhoCorasickOptimized from the currently selected genes. The instance is stored internally and can also be retrieved by getGeneNameDictionary().
        Returns:
        A trie dictionary compiled from the names (text occurrence) of all selected genes.
      • agglomerateByAcronyms

        public void agglomerateByAcronyms()
        Merges those gene sets that are connected via acronym resolution or, for gene mentions that are not covered by any acronym, merges by name.
      • agglomerateByNames

        public void agglomerateByNames()
      • equals

        public boolean equals​(java.lang.Object o)
        Overrides:
        equals in class java.lang.Object
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class java.lang.Object
      • setSpeciesMeshHeadings

        public void setSpeciesMeshHeadings​(java.util.Collection<MeshHeading> meshHeadings)
      • getMeshHeadings

        public java.util.Collection<MeshHeading> getMeshHeadings()
      • setMeshHeadings

        public void setMeshHeadings​(java.util.Collection<MeshHeading> meshHeadings)
      • getGenesWithText

        public java.util.stream.Stream<GeneMention> getGenesWithText​(java.lang.String text)
      • getDefaultSpecies

        public java.lang.String getDefaultSpecies()
      • setDefaultSpecies

        public void setDefaultSpecies​(java.lang.String defaultSpecies)
      • getNearestPreviousSpeciesMention

        public java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,​SpeciesMention> getNearestPreviousSpeciesMention​(org.apache.commons.lang3.Range<java.lang.Integer> range,
                                                                                                                                            java.lang.String taxId)
      • getNearestNextSpeciesMention

        public java.util.Map.Entry<org.apache.commons.lang3.Range<java.lang.Integer>,​SpeciesMention> getNearestNextSpeciesMention​(org.apache.commons.lang3.Range<java.lang.Integer> range,
                                                                                                                                        java.lang.String taxId)