org.apache.oodt.cas.pushpull.retrievalsystem
Class FileRetrievalSystem

java.lang.Object
  extended by org.apache.oodt.cas.pushpull.retrievalsystem.FileRetrievalSystem

public class FileRetrievalSystem
extends Object

    Will crawl external directory structures and will download the files within these structures. 
    
    This class's settings are set using a java .properties file which can be read in and parsed by Config.java.
    This .properties file should have the following properties set:
     
        #list of sites to crawl
   	protocol.external.sources=<path-to-xml-file>
   
   	#protocol types
   	protocolfactory.types=<list-of-protocols-separated-by-commas> (e.g. ftp,http,https,sftp)
   
   	#Protocol factories per types (must have one for each protocol mention in protocolfactory.types -- the property must be name
    	# as such: protocolfactory.<name-of-protocol-type>
   	protocolfactory.ftp=<path-to-java-protocolfactory-class> (e.g. org.apache.oodt.cas.protocol.ftp.FtpClientFactory)
   	protocolfactory.http=<path-to-java-protocolfactory-class>
   	protocolfactory.https=<path-to-java-protocolfactory-class>
   	protocolfactory.sftp=<path-to-java-protocolfactory-class>
   
   	#configuration to make java.net.URL accept unsupported protocols -- must exist just as shown
   	java.protocol.handler.pkgs=org.apache.oodt.cas.url.handlers
    
    
    In order to specify which external sites to crawl you must create a XML file which contains the 
    the site and necessary information needed to crawl the site, such as username and password.
    protocol.external.sources must contain the path to this file so the crawl knows where to find it.
    You can also train this class on how to crawl each given site.  This is also specified in an XML
    file, whose path must be given in the first mentioned XML file which contians the username and password.
    
    Then schema for the external sites XML file is as such:
    
        <sources>
    	   <source url="url-of-server">
    	      <username>username</username>
    	      <password>password</password>
    	      <dirstruct>path-to-xml-file</dirstruct>
    	      <crawl>yes-or-no</crawl>
    	   </source>
    	   ...
    	   ...
    	   ...
    	</sources\>
    
    You may specify as many sources as you would like by specifying multiple <source> tags.
    In the <source> tag, the parameter 'url' must be specified.  This is the url of the server
    you want the crawler to connect to.  It should be of the following format:
    <protocol>://<host> (e.g. sftp://remote.computer.gov)
    If no username and password exist, then these elements can be omitted (they are optional).
    For <crawl> place yes or no here.  This is for convenience of being able to keep record of the
    sites and their information in this XML file even if you decide that you no longer need to crawl it
    anymore (just put <crawl>no</crawl> and the crawl will not crawl that site).
    <dirStruct> contains a path to another XML file which is documented in DirStruct.java javadoc.  This
    element is optional.  If no <dirStruct> is given, then every directory will be crawled on the site
    and every encountered file will be downloaded.
 

Author:
bfoster

Constructor Summary
FileRetrievalSystem(Config config, SiteInfo siteInfo)
          Creates a Crawler based on the URL, DirStruct, and Config objects passed in.
 
Method Summary
 boolean addToDownloadQueue(ProtocolFile file, String renamingString, File downloadToDir, String uniqueMetadataElement, boolean deleteAfterDownload)
           
 boolean addToDownloadQueue(RemoteSite remoteSite, String file, String renamingString, File downloadToDir, String uniqueMetadataElement, boolean deleteAfterDownload)
           
 void changeToDir(ProtocolFile pFile)
           
 void changeToDir(String dir, RemoteSite remoteSite)
           
 void changeToHOME(RemoteSite remoteSite)
           
 void changeToRoot(RemoteSite remoteSite)
           
 void clearErrorFlag()
          reset error flag
 void clearFailedDownloadsList()
           
 boolean closeSessions()
          Disconnects all downloading Protocol sessions in the avaiableSessions list.
 ProtocolFile getCurrentFile(RemoteSite remoteSite)
           
 LinkedList<ProtocolFile> getCurrentlyDownloadingFiles()
           
 ProtocolFile getHomeDir(RemoteSite remoteSite)
           
 LinkedList<ProtocolFile> getListOfFailedDownloads()
           
 List<ProtocolFile> getNextPage(ProtocolFile dir, ProtocolFileFilter filter)
           
 ProtocolFile getProtocolFile(RemoteSite remoteSite, String file, boolean isDir)
           
 void initialize()
           
 boolean isAlreadyInDatabase(RemoteFile rf)
           
 boolean isDownloading(ProtocolFile pFile)
           
 void registerDownloadListener(DownloadListener dListener)
           
 void shutdown()
           
 boolean validate(RemoteSite remoteSite)
           
 void waitUntilAllCurrentDownloadsAreComplete()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FileRetrievalSystem

public FileRetrievalSystem(Config config,
                           SiteInfo siteInfo)
                    throws InstantiationException
Creates a Crawler based on the URL, DirStruct, and Config objects passed in. If no DirStruct is needed then set it to null.

Parameters:
url - The URL for which you want this Crawler to crawl
dirStruct - The specified directory structure located at the host -- use to train crawler (see DirStruct).
config - The Configuration file that is passed to this objects ProtocolHandler.
Throws:
InstantiationException
DatabaseException
Method Detail

registerDownloadListener

public void registerDownloadListener(DownloadListener dListener)

initialize

public void initialize()
                throws IOException
Throws:
IOException

clearErrorFlag

public void clearErrorFlag()
reset error flag


isAlreadyInDatabase

public boolean isAlreadyInDatabase(RemoteFile rf)
                            throws org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
Throws:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException

getNextPage

public List<ProtocolFile> getNextPage(ProtocolFile dir,
                                      ProtocolFileFilter filter)
                               throws RemoteConnectionException
Throws:
RemoteConnectionException

changeToRoot

public void changeToRoot(RemoteSite remoteSite)
                  throws ProtocolException,
                         MalformedURLException
Throws:
ProtocolException
MalformedURLException

changeToHOME

public void changeToHOME(RemoteSite remoteSite)
                  throws ProtocolException,
                         MalformedURLException
Throws:
ProtocolException
MalformedURLException

changeToDir

public void changeToDir(String dir,
                        RemoteSite remoteSite)
                 throws MalformedURLException,
                        ProtocolException
Throws:
MalformedURLException
ProtocolException

changeToDir

public void changeToDir(ProtocolFile pFile)
                 throws ProtocolException,
                        MalformedURLException
Throws:
ProtocolException
MalformedURLException

getHomeDir

public ProtocolFile getHomeDir(RemoteSite remoteSite)
                        throws ProtocolException
Throws:
ProtocolException

getProtocolFile

public ProtocolFile getProtocolFile(RemoteSite remoteSite,
                                    String file,
                                    boolean isDir)
                             throws ProtocolException
Throws:
ProtocolException

getCurrentFile

public ProtocolFile getCurrentFile(RemoteSite remoteSite)
                            throws ProtocolFileException,
                                   ProtocolException,
                                   MalformedURLException
Throws:
ProtocolFileException
ProtocolException
MalformedURLException

addToDownloadQueue

public boolean addToDownloadQueue(RemoteSite remoteSite,
                                  String file,
                                  String renamingString,
                                  File downloadToDir,
                                  String uniqueMetadataElement,
                                  boolean deleteAfterDownload)
                           throws ToManyFailedDownloadsException,
                                  RemoteConnectionException,
                                  ProtocolFileException,
                                  ProtocolException,
                                  AlreadyInDatabaseException,
                                  UndefinedTypeException,
                                  org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException,
                                  IOException
Throws:
ToManyFailedDownloadsException
RemoteConnectionException
ProtocolFileException
ProtocolException
AlreadyInDatabaseException
UndefinedTypeException
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
IOException

validate

public boolean validate(RemoteSite remoteSite)

waitUntilAllCurrentDownloadsAreComplete

public void waitUntilAllCurrentDownloadsAreComplete()
                                             throws ProtocolException
Throws:
ProtocolException

addToDownloadQueue

public boolean addToDownloadQueue(ProtocolFile file,
                                  String renamingString,
                                  File downloadToDir,
                                  String uniqueMetadataElement,
                                  boolean deleteAfterDownload)
                           throws ToManyFailedDownloadsException,
                                  RemoteConnectionException,
                                  AlreadyInDatabaseException,
                                  UndefinedTypeException,
                                  org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException,
                                  IOException
Throws:
ToManyFailedDownloadsException
RemoteConnectionException
AlreadyInDatabaseException
UndefinedTypeException
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
IOException

isDownloading

public boolean isDownloading(ProtocolFile pFile)

getCurrentlyDownloadingFiles

public LinkedList<ProtocolFile> getCurrentlyDownloadingFiles()

getListOfFailedDownloads

public LinkedList<ProtocolFile> getListOfFailedDownloads()

clearFailedDownloadsList

public void clearFailedDownloadsList()

shutdown

public void shutdown()

closeSessions

public boolean closeSessions()
                      throws RemoteConnectionException
Disconnects all downloading Protocol sessions in the avaiableSessions list. The ThreadPoolExecutor needs to be completely shutdown before this method should be called. Otherwise some Protocols might not be disconnected or left downloading.

Returns:
True if successful, false otherwise
Throws:
RemoteConnectionException


Copyright © 1999-2011 Apache Incubator. All Rights Reserved.