Class DypsisContextRanker

    • Method Detail

      • setClassifier

        public void setClassifier​(cc.mallet.classify.Classifier classifier)
      • setRanker

        public void setRanker​(de.julielab.ml.RankLibRanker ranker)
      • createClassificationInstances

        public cc.mallet.types.InstanceList createClassificationInstances​(de.julielab.geneexpbase.genemodel.GeneDocument document,
                                                                          Map<String,​Map<String,​Double>> ids2scores,
                                                                          de.julielab.geneexpbase.configuration.Parameters parameters)
      • getSemanticRerankingFeatureInstance

        public cc.mallet.types.Instance getSemanticRerankingFeatureInstance​(de.julielab.geneexpbase.genemodel.GeneDocument document,
                                                                            Map<String,​Map<String,​Double>> ids2scores,
                                                                            cc.mallet.types.LabelAlphabet targetAlphabet,
                                                                            int gmid,
                                                                            de.julielab.geneexpbase.genemodel.GeneMention gm,
                                                                            de.julielab.geneexpbase.candidateretrieval.SynHit candidate,
                                                                            Map<String,​cc.mallet.types.FeatureVector> candidateRerankingFeatureVectors,
                                                                            cc.mallet.pipe.Pipe featurePipes,
                                                                            de.julielab.geneexpbase.configuration.Parameters parameters)

        Create a single disambiguation instance for the given candidate database entry relative to the GeneMention gm.

        The candidate itself is stored in the sh property of the instance for later retrieval.

        Parameters:
        document - The gene document.
        ids2scores - The disambiguation score map for each ID found in the document.
        targetAlphabet - MALLET label alphabet.
        gmid - An integer to unambiguously identify the gene mentions in the document.
        gm - The gene mention to disambiguate.
        candidate - The current candidate to create a feature instance for.
        candidateRerankingFeatureVectors - Precomputed features vectors from the lexical reranking step
        featurePipes - MALLET pipes to be used for feature extraction.
        parameters - Parameter settingsl
        Returns:
        The created feature instance with the SynHit candidate stored in the sh property.
      • collectCandidateDisambiguationScores

        public Map<String,​Map<String,​Double>> collectCandidateDisambiguationScores​(de.julielab.geneexpbase.genemodel.GeneDocument document,
                                                                                               de.julielab.geneexpbase.configuration.Parameters parameters)
      • addSynonym2ContextItemsScores

        public void addSynonym2ContextItemsScores​(Map<String,​Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities,
                                                  Map<String,​Map<String,​Double>> ids2scores,
                                                  de.julielab.geneexpbase.configuration.Parameters parameters)
                                           throws IOException

        For each gene ID occurring in the document, this method uses all textual occurrences associated with this ID and formulates a query of it. This query is then matched to the gene ID's generif, interaction, summary and description fields to retrieve a measure for how well the gene names in the text may correspond to those context items.

        Scores:

        • gene names from the text to generif text
        • gene names from the text to interaction text
        • gene names from the text to summary text
        • gene names from the text to description text

        Parameters:
        ids2entities - ID to gene name map.
        ids2scores - Gene ID features.
        parameters - Parameter settings.
        Throws:
        IOException - If index search fails.
      • setGeneSetContext2ContextItemsScores

        public void setGeneSetContext2ContextItemsScores​(de.julielab.geneexpbase.genemodel.GeneDocument document,
                                                         Set<String> geneIdsToScore,
                                                         de.julielab.geneexpbase.configuration.Parameters parameters)
                                                  throws IOException

        Obtains scores by assembling the token context of the gene set where each gene ID occurs in an matching this to the context items via a Lucene disjunction query. Also creates JaroWinkler comparisons between gene set textual contexts and the context items texts.

        Scores:

        • gscontext on generif text
        • gscontext on interaction text
        • gscontext on summary text
        • gscontext on description text

        Parameters:
        document - The GeneDocument to disambiguate.
        geneIdsToScore - The IDs to obtain scores for. For each gene set, only those IDs are scores that have actual candidates in the set.
        parameters - The parameter settings.
        Throws:
        IOException - If index access fails.
      • addEntity2SynonymsJaroWinklerScores

        public void addEntity2SynonymsJaroWinklerScores​(Map<String,​Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities,
                                                        Map<String,​com.google.common.collect.Multimap<String,​String>> id2synonyms,
                                                        boolean exactMatches,
                                                        Map<String,​Map<String,​Double>> ids2scores,
                                                        de.julielab.geneexpbase.configuration.Parameters parameters)

        Performs Jaro-Winkler string similarity scoring between the gene names in the document and the database synonyms of the gene IDs found by those gene names.

        This delivers a second score besides the Lucene score which is not normalized.

        Parameters:
        ids2entities - The IDs found for the gene names in the document.
        id2synonyms - The database synonyms for for each gene ID.
        exactMatches - Whether the given synonyms stem from exact matching between text gene names and database synonyms.
        ids2scores - The score assembly map.
        parameters - The algorithmic parameters.
      • addEntity2SynonymsLuceneScores

        public Map<String,​com.google.common.collect.Multimap<String,​String>> addEntity2SynonymsLuceneScores​(Map<String,​Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities,
                                                                                                                        String queryType,
                                                                                                                        Map<String,​Map<String,​Double>> ids2scores,
                                                                                                                        de.julielab.geneexpbase.configuration.Parameters parameters)

        Obtains lucene scores for all gene names in the document. Depending on the queryType parameter - "exact" or "approx" - the gene names are searched as single, exact terms or as a bag-of-words disjunction.

        This is our version of GNormPlus' entity and bag-of-words inference. However, we get unnormalized Lucene scores where GNormPlus uses its inference network to get scores in [0, 1].

        Parameters:
        ids2entities - The map listing the possible gene IDs basing on the gene names in the document and which names point to which ID.
        queryType - "exact" or "approx" for entity or bag-of-words inference, respectively.
        ids2scores - The map for scoring the possible gene IDs.
        parameters - The algorithmic parameters.
        Returns:
        All found synonyms of the given gene IDs that were found in the Lucene index. This value is used to calculate other similarity metrics on the candidate synonyms.
      • getMaxAgglomerationCandidates

        public int getMaxAgglomerationCandidates()
      • setMaxAgglomerationCandidates

        public void setMaxAgglomerationCandidates​(int maxCandidates)
      • shutdown

        public void shutdown()
      • getParameters

        public de.julielab.geneexpbase.configuration.Parameters getParameters()