Class DypsisContextRanker
- java.lang.Object
-
- de.julielab.genemapper.disambig.DypsisContextRanker
-
- All Implemented Interfaces:
ContextRanker
public class DypsisContextRanker extends Object implements ContextRanker
-
-
Constructor Summary
Constructors Constructor Description DypsisContextRanker(Configuration config, CandidateRetrieval candidateRetrieval, ContextItemsIndex contextItemsIndex, de.julielab.geneexpbase.services.CacheService cacheService)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddEntity2SynonymsJaroWinklerScores(Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, Map<String,com.google.common.collect.Multimap<String,String>> id2synonyms, boolean exactMatches, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)Performs Jaro-Winkler string similarity scoring between the gene names in the document and the database synonyms of the gene IDs found by those gene names.Map<String,com.google.common.collect.Multimap<String,String>>addEntity2SynonymsLuceneScores(Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, String queryType, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)Obtains lucene scores for all gene names in the document.voidaddSynonym2ContextItemsScores(Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)For each gene ID occurring in the document, this method uses all textual occurrences associated with this ID and formulates a query of it.de.julielab.geneexpbase.genemodel.MentionMappingResultassignContextScore(MentionDisambiguationData disambiguationData)voidclear()Map<String,Map<String,Double>>collectCandidateDisambiguationScores(de.julielab.geneexpbase.genemodel.GeneDocument document, de.julielab.geneexpbase.configuration.Parameters parameters)cc.mallet.types.InstanceListcreateClassificationInstances(de.julielab.geneexpbase.genemodel.GeneDocument document, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)cc.mallet.types.InstanceListdoContextRanking(DocumentDisambiguationData disambiguationData, de.julielab.geneexpbase.configuration.Parameters parameters)ContextItemsIndexgetContextItemsIndex()intgetMaxAgglomerationCandidates()de.julielab.geneexpbase.configuration.ParametersgetParameters()SemanticIndexgetSemanticIndex()cc.mallet.types.InstancegetSemanticRerankingFeatureInstance(de.julielab.geneexpbase.genemodel.GeneDocument document, Map<String,Map<String,Double>> ids2scores, cc.mallet.types.LabelAlphabet targetAlphabet, int gmid, de.julielab.geneexpbase.genemodel.GeneMention gm, de.julielab.geneexpbase.candidateretrieval.SynHit candidate, Map<String,cc.mallet.types.FeatureVector> candidateRerankingFeatureVectors, cc.mallet.pipe.Pipe featurePipes, de.julielab.geneexpbase.configuration.Parameters parameters)Create a single disambiguation instance for the given candidate database entry relative to the GeneMention gm.voidsetClassifier(cc.mallet.classify.Classifier classifier)voidsetGeneSetContext2ContextItemsScores(de.julielab.geneexpbase.genemodel.GeneDocument document, Set<String> geneIdsToScore, de.julielab.geneexpbase.configuration.Parameters parameters)Obtains scores by assembling the token context of the gene set where each gene ID occurs in an matching this to the context items via a Lucene disjunction query.voidsetMaxAgglomerationCandidates(int maxCandidates)voidsetRanker(de.julielab.ml.RankLibRanker ranker)voidshutdown()
-
-
-
Constructor Detail
-
DypsisContextRanker
@Inject public DypsisContextRanker(Configuration config, CandidateRetrieval candidateRetrieval, ContextItemsIndex contextItemsIndex, de.julielab.geneexpbase.services.CacheService cacheService) throws GeneMapperException
- Throws:
GeneMapperException
-
-
Method Detail
-
setClassifier
public void setClassifier(cc.mallet.classify.Classifier classifier)
-
setRanker
public void setRanker(de.julielab.ml.RankLibRanker ranker)
-
assignContextScore
public de.julielab.geneexpbase.genemodel.MentionMappingResult assignContextScore(MentionDisambiguationData disambiguationData)
- Specified by:
assignContextScorein interfaceContextRanker
-
doContextRanking
public cc.mallet.types.InstanceList doContextRanking(DocumentDisambiguationData disambiguationData, de.julielab.geneexpbase.configuration.Parameters parameters)
- Specified by:
doContextRankingin interfaceContextRanker
-
createClassificationInstances
public cc.mallet.types.InstanceList createClassificationInstances(de.julielab.geneexpbase.genemodel.GeneDocument document, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)
-
getSemanticRerankingFeatureInstance
public cc.mallet.types.Instance getSemanticRerankingFeatureInstance(de.julielab.geneexpbase.genemodel.GeneDocument document, Map<String,Map<String,Double>> ids2scores, cc.mallet.types.LabelAlphabet targetAlphabet, int gmid, de.julielab.geneexpbase.genemodel.GeneMention gm, de.julielab.geneexpbase.candidateretrieval.SynHit candidate, Map<String,cc.mallet.types.FeatureVector> candidateRerankingFeatureVectors, cc.mallet.pipe.Pipe featurePipes, de.julielab.geneexpbase.configuration.Parameters parameters)Create a single disambiguation instance for the given candidate database entry relative to the GeneMention gm.
The candidate itself is stored in the sh property of the instance for later retrieval.
- Parameters:
document- The gene document.ids2scores- The disambiguation score map for each ID found in the document.targetAlphabet- MALLET label alphabet.gmid- An integer to unambiguously identify the gene mentions in the document.gm- The gene mention to disambiguate.candidate- The current candidate to create a feature instance for.candidateRerankingFeatureVectors- Precomputed features vectors from the lexical reranking stepfeaturePipes- MALLET pipes to be used for feature extraction.parameters- Parameter settingsl- Returns:
- The created feature instance with the SynHit candidate stored in the sh property.
-
collectCandidateDisambiguationScores
public Map<String,Map<String,Double>> collectCandidateDisambiguationScores(de.julielab.geneexpbase.genemodel.GeneDocument document, de.julielab.geneexpbase.configuration.Parameters parameters)
-
addSynonym2ContextItemsScores
public void addSynonym2ContextItemsScores(Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters) throws IOException
For each gene ID occurring in the document, this method uses all textual occurrences associated with this ID and formulates a query of it. This query is then matched to the gene ID's generif, interaction, summary and description fields to retrieve a measure for how well the gene names in the text may correspond to those context items.
Scores:
- gene names from the text to generif text
- gene names from the text to interaction text
- gene names from the text to summary text
- gene names from the text to description text
- Parameters:
ids2entities- ID to gene name map.ids2scores- Gene ID features.parameters- Parameter settings.- Throws:
IOException- If index search fails.
-
setGeneSetContext2ContextItemsScores
public void setGeneSetContext2ContextItemsScores(de.julielab.geneexpbase.genemodel.GeneDocument document, Set<String> geneIdsToScore, de.julielab.geneexpbase.configuration.Parameters parameters) throws IOExceptionObtains scores by assembling the token context of the gene set where each gene ID occurs in an matching this to the context items via a Lucene disjunction query. Also creates JaroWinkler comparisons between gene set textual contexts and the context items texts.
Scores:
- gscontext on generif text
- gscontext on interaction text
- gscontext on summary text
- gscontext on description text
- Parameters:
document- The GeneDocument to disambiguate.geneIdsToScore- The IDs to obtain scores for. For each gene set, only those IDs are scores that have actual candidates in the set.parameters- The parameter settings.- Throws:
IOException- If index access fails.
-
addEntity2SynonymsJaroWinklerScores
public void addEntity2SynonymsJaroWinklerScores(Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, Map<String,com.google.common.collect.Multimap<String,String>> id2synonyms, boolean exactMatches, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)
Performs Jaro-Winkler string similarity scoring between the gene names in the document and the database synonyms of the gene IDs found by those gene names.
This delivers a second score besides the Lucene score which is not normalized.
- Parameters:
ids2entities- The IDs found for the gene names in the document.id2synonyms- The database synonyms for for each gene ID.exactMatches- Whether the given synonyms stem from exact matching between text gene names and database synonyms.ids2scores- The score assembly map.parameters- The algorithmic parameters.
-
addEntity2SynonymsLuceneScores
public Map<String,com.google.common.collect.Multimap<String,String>> addEntity2SynonymsLuceneScores(Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, String queryType, Map<String,Map<String,Double>> ids2scores, de.julielab.geneexpbase.configuration.Parameters parameters)
Obtains lucene scores for all gene names in the document. Depending on the queryType parameter - "exact" or "approx" - the gene names are searched as single, exact terms or as a bag-of-words disjunction.
This is our version of GNormPlus' entity and bag-of-words inference. However, we get unnormalized Lucene scores where GNormPlus uses its inference network to get scores in [0, 1].
- Parameters:
ids2entities- The map listing the possible gene IDs basing on the gene names in the document and which names point to which ID.queryType- "exact" or "approx" for entity or bag-of-words inference, respectively.ids2scores- The map for scoring the possible gene IDs.parameters- The algorithmic parameters.- Returns:
- All found synonyms of the given gene IDs that were found in the Lucene index. This value is used to calculate other similarity metrics on the candidate synonyms.
-
getSemanticIndex
public SemanticIndex getSemanticIndex()
- Specified by:
getSemanticIndexin interfaceContextRanker
-
clear
public void clear()
- Specified by:
clearin interfaceContextRanker
-
getMaxAgglomerationCandidates
public int getMaxAgglomerationCandidates()
-
setMaxAgglomerationCandidates
public void setMaxAgglomerationCandidates(int maxCandidates)
-
shutdown
public void shutdown()
-
getContextItemsIndex
public ContextItemsIndex getContextItemsIndex()
-
getParameters
public de.julielab.geneexpbase.configuration.Parameters getParameters()
-
-