Class DocumentReleaseCheckpoint


  • public class DocumentReleaseCheckpoint
    extends Object

    This is class is a synchronization point for JeDIS components to report documents as being completely finished with processing.

    Problem explanation: This synchronization is necessary because most database operating components work in batch mode for performance reasons. However, if multiple components use batching wich might be out of sync due to different batch sizes and possibly other factors, one component may have sent a batch of document data to the database while other components have not at a particular point in time. If at such a time point the pipeline crashes or is manually interrupted, the actually written data is incoherent in the sense that some components have sent their data for a particular document and others have not.

    This class does not completely resolve this issue, i.e. asynchronously sending of batches is still an issue when using this class. However, this class is used by the DBCheckpointAE to determine if a set of registered components have all released a given DocumentId before marking it as successfully processed in the JeDIS database subset table. In this way, an uncoherent state can be seen in the database by items that are in process but have not been processed after the pipeline finishes.

    Those documents can then easily be reprocessed by removing the in process mark with CoStoSys.

    Note that this requires that the DBCheckpointAE marking documents as processed is the last component in the pipeline

    • Method Detail

      • register

        public void register​(String componentKey)

        Registers a component that will add DocumentIds via the release(String, Stream) method.

        Parameters:
        componentKey - A canonical identifier of the component taking part in synchronization.
      • unregister

        public void unregister​(String componentKey)

        Removes a component from the list of document ID releasing components.

        This method is not commonly required and only here for functional completeness.

        Parameters:
        componentKey - The canonical identifier provided in register(String) earlier.
      • release

        public void release​(String componentKey,
                            Stream<DocumentId> releasedDocumentIds)

        To be called from synchronizing components. They send their registration key and the document IDs they are positively finished with.

        Parameters:
        componentKey - The canonical identifier provided in register(String) earlier.
        releasedDocumentIds - The document IDs to be released.
      • getReleasedDocumentIds

        public Set<DocumentId> getReleasedDocumentIds()

        Used by the DBCheckpointAE to determine documents that can safely be marked as being finished with processing.

        Gets all the document IDs from all synchronizing components that those components have released. The returned list will contain duplicates of document IDs when multiple components have released that document. The DBCheckpointAE will only mark those documents as processed that have been released as often as synchronizing components have been registered with register(String).

        Returns:
        The currently released document IDs.
      • getNumberOfRegisteredComponents

        public int getNumberOfRegisteredComponents()

        Returns the number of currently registered components.

        Returns:
        The number of currently registered components.