Class DBReader
- java.lang.Object
-
- org.apache.uima.resource.Resource_ImplBase
-
- org.apache.uima.resource.ConfigurableResource_ImplBase
-
- org.apache.uima.collection.CollectionReader_ImplBase
-
- org.apache.uima.fit.component.JCasCollectionReader_ImplBase
-
- de.julielab.jcore.reader.db.DBReaderBase
-
- de.julielab.jcore.reader.db.DBSubsetReader
-
- de.julielab.jcore.reader.db.DBReader
-
- All Implemented Interfaces:
org.apache.uima.collection.base_cpm.BaseCollectionReader,org.apache.uima.collection.CollectionReader,org.apache.uima.resource.ConfigurableResource,org.apache.uima.resource.Resource
public abstract class DBReader extends DBSubsetReader
Base for UIMA collection readers using a (PostgreSQL) database to retrieve their documents from.The reader interacts with two tables: One 'subset' table listing the document collection with one document per row. Additionally, each row contains fields for information about current processing status of a document as well as error status and processing host. This table will be locked while getting a batch of documents to process, thus it furthermore serves as a synchronization medium.
The second table holds the actual data, thus we say 'data table'. The subset table has to define foreign keys to the data table. In this way, the reader is able to determine from which table to retrieve the document data.
This data management is done by the julie-medline-manager package.
Please note that this class does not implement
JCasCollectionReader_ImplBase.getNext(org.apache.uima.cas.CAS). Instead,getNextArtifactData()is offered to expose the documents read from the database. Until this point, no assumption about the document's structure has been made. That is, we don't care in this class whether we deal with Medline abstracts, plain texts, some HTML documents or whatever. Translating these documents into a CAS with respect to a particular type system is delegated to the extending class.- Author:
- landefeld/hellrich/faessler
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected classDBReader.RetrievingThreadThis class is charged with retrieving batches of document IDs and documents while previously fetched documents are in process.
-
Field Summary
Fields Modifier and Type Field Description protected StringdataTimestamp-
Fields inherited from class de.julielab.jcore.reader.db.DBSubsetReader
additionalTableNames, additionalTableSchemas, dataTable, fetchIdsProactively, hostName, PARAM_ADDITIONAL_TABLES, PARAM_ADDITONAL_TABLES_STORAGE_PG_SCHEMA, PARAM_RESET_TABLE, pid, readDataTable, resetTable, schemas, tables
-
Fields inherited from class de.julielab.jcore.reader.db.DBReaderBase
batchSize, costosysConfig, dbc, driver, hasNext, joinTables, limitParameter, numberFetchedDocIDs, PARAM_BATCH_SIZE, PARAM_COSTOSYS_CONFIG_NAME, PARAM_TABLE, processedDocuments, selectionOrder, tableName, totalDocumentCount, whereCondition
-
-
Constructor Summary
Constructors Constructor Description DBReader()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description voidclose()byte[][]getNextArtifactData()Returns the next byte[][] containing a byte[] for the pmid at [0] and a byte[] for the XML at [1] or null if there are no unprocessed documents left.org.apache.uima.util.Progress[]getProgress()protected abstract StringgetReaderComponentName()booleanhasNext()voidinitialize(org.apache.uima.UimaContext context)static StringsetDBProcessingMetaData(de.julielab.costosys.dbconnection.DataBaseConnector dbc, boolean readDataTable, String tableName, byte[][] data, org.apache.uima.jcas.JCas cas)-
Methods inherited from class de.julielab.jcore.reader.db.DBSubsetReader
checkAdditionalTableParameters, checkAndAdjustAdditionalTables, getAllRetrievedColumns
-
Methods inherited from class org.apache.uima.fit.component.JCasCollectionReader_ImplBase
getLogger, getNext, getNext, initialize
-
Methods inherited from class org.apache.uima.collection.CollectionReader_ImplBase
destroy, getCasInitializer, getProcessingResourceMetaData, initialize, isConsuming, reconfigure, setCasInitializer, typeSystemInit
-
Methods inherited from class org.apache.uima.resource.ConfigurableResource_ImplBase
getConfigParameterValue, getConfigParameterValue, setConfigParameterValue, setConfigParameterValue
-
Methods inherited from class org.apache.uima.resource.Resource_ImplBase
getCasManager, getMetaData, getRelativePathResolver, getResourceManager, getUimaContext, getUimaContextAdmin, setLogger, setMetaData
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
-
-
-
Field Detail
-
dataTimestamp
protected String dataTimestamp
-
-
Method Detail
-
setDBProcessingMetaData
public static String setDBProcessingMetaData(de.julielab.costosys.dbconnection.DataBaseConnector dbc, boolean readDataTable, String tableName, byte[][] data, org.apache.uima.jcas.JCas cas)
-
initialize
public void initialize(org.apache.uima.UimaContext context) throws org.apache.uima.resource.ResourceInitializationException- Overrides:
initializein classDBSubsetReader- Throws:
org.apache.uima.resource.ResourceInitializationException
-
hasNext
public boolean hasNext() throws IOException, org.apache.uima.collection.CollectionException- Throws:
IOExceptionorg.apache.uima.collection.CollectionException
-
getNextArtifactData
public byte[][] getNextArtifactData() throws org.apache.uima.collection.CollectionExceptionReturns the next byte[][] containing a byte[] for the pmid at [0] and a byte[] for the XML at [1] or null if there are no unprocessed documents left.- Returns:
- Document document - the document
- Throws:
org.apache.uima.collection.CollectionException
-
getProgress
public org.apache.uima.util.Progress[] getProgress()
-
close
public void close()
- Specified by:
closein interfaceorg.apache.uima.collection.base_cpm.BaseCollectionReader- Overrides:
closein classorg.apache.uima.fit.component.JCasCollectionReader_ImplBase
-
getReaderComponentName
protected abstract String getReaderComponentName()
- Returns:
- The component name of the reader to fill in the subset table's pipeline status field
-
-