Class DBReader

  • All Implemented Interfaces:
    org.apache.uima.collection.base_cpm.BaseCollectionReader, org.apache.uima.collection.CollectionReader, org.apache.uima.resource.ConfigurableResource, org.apache.uima.resource.Resource

    public abstract class DBReader
    extends DBSubsetReader
    Base for UIMA collection readers using a (PostgreSQL) database to retrieve their documents from.

    The reader interacts with two tables: One 'subset' table listing the document collection with one document per row. Additionally, each row contains fields for information about current processing status of a document as well as error status and processing host. This table will be locked while getting a batch of documents to process, thus it furthermore serves as a synchronization medium.

    The second table holds the actual data, thus we say 'data table'. The subset table has to define foreign keys to the data table. In this way, the reader is able to determine from which table to retrieve the document data.

    This data management is done by the julie-medline-manager package.

    Please note that this class does not implement JCasCollectionReader_ImplBase.getNext(org.apache.uima.cas.CAS). Instead, getNextArtifactData() is offered to expose the documents read from the database. Until this point, no assumption about the document's structure has been made. That is, we don't care in this class whether we deal with Medline abstracts, plain texts, some HTML documents or whatever. Translating these documents into a CAS with respect to a particular type system is delegated to the extending class.

    Author:
    landefeld/hellrich/faessler
    • Field Detail

      • dataTimestamp

        protected String dataTimestamp
    • Constructor Detail

      • DBReader

        public DBReader()
    • Method Detail

      • setDBProcessingMetaData

        public static String setDBProcessingMetaData​(de.julielab.costosys.dbconnection.DataBaseConnector dbc,
                                                     boolean readDataTable,
                                                     String tableName,
                                                     byte[][] data,
                                                     org.apache.uima.jcas.JCas cas)
      • initialize

        public void initialize​(org.apache.uima.UimaContext context)
                        throws org.apache.uima.resource.ResourceInitializationException
        Overrides:
        initialize in class DBSubsetReader
        Throws:
        org.apache.uima.resource.ResourceInitializationException
      • hasNext

        public boolean hasNext()
                        throws IOException,
                               org.apache.uima.collection.CollectionException
        Throws:
        IOException
        org.apache.uima.collection.CollectionException
      • getNextArtifactData

        public byte[][] getNextArtifactData()
                                     throws org.apache.uima.collection.CollectionException
        Returns the next byte[][] containing a byte[] for the pmid at [0] and a byte[] for the XML at [1] or null if there are no unprocessed documents left.
        Returns:
        Document document - the document
        Throws:
        org.apache.uima.collection.CollectionException
      • getProgress

        public org.apache.uima.util.Progress[] getProgress()
      • close

        public void close()
        Specified by:
        close in interface org.apache.uima.collection.base_cpm.BaseCollectionReader
        Overrides:
        close in class org.apache.uima.fit.component.JCasCollectionReader_ImplBase
      • getReaderComponentName

        protected abstract String getReaderComponentName()
        Returns:
        The component name of the reader to fill in the subset table's pipeline status field