org.apache.oodt.cas.pushpull.retrievalsystem
Class FileRetrievalSystem
java.lang.Object
org.apache.oodt.cas.pushpull.retrievalsystem.FileRetrievalSystem
public class FileRetrievalSystem
- extends Object
Will crawl external directory structures and will download the files within these structures.
This class's settings are set using a java .properties file which can be read in and parsed by Config.java.
This .properties file should have the following properties set:
#list of sites to crawl
protocol.external.sources=<path-to-xml-file>
#protocol types
protocolfactory.types=<list-of-protocols-separated-by-commas> (e.g. ftp,http,https,sftp)
#Protocol factories per types (must have one for each protocol mention in protocolfactory.types -- the property must be name
# as such: protocolfactory.<name-of-protocol-type>
protocolfactory.ftp=<path-to-java-protocolfactory-class> (e.g. org.apache.oodt.cas.protocol.ftp.FtpClientFactory)
protocolfactory.http=<path-to-java-protocolfactory-class>
protocolfactory.https=<path-to-java-protocolfactory-class>
protocolfactory.sftp=<path-to-java-protocolfactory-class>
#configuration to make java.net.URL accept unsupported protocols -- must exist just as shown
java.protocol.handler.pkgs=org.apache.oodt.cas.url.handlers
In order to specify which external sites to crawl you must create a XML file which contains the
the site and necessary information needed to crawl the site, such as username and password.
protocol.external.sources must contain the path to this file so the crawl knows where to find it.
You can also train this class on how to crawl each given site. This is also specified in an XML
file, whose path must be given in the first mentioned XML file which contians the username and password.
Then schema for the external sites XML file is as such:
<sources>
<source url="url-of-server">
<username>username</username>
<password>password</password>
<dirstruct>path-to-xml-file</dirstruct>
<crawl>yes-or-no</crawl>
</source>
...
...
...
</sources\>
You may specify as many sources as you would like by specifying multiple <source> tags.
In the <source> tag, the parameter 'url' must be specified. This is the url of the server
you want the crawler to connect to. It should be of the following format:
<protocol>://<host> (e.g. sftp://remote.computer.gov)
If no username and password exist, then these elements can be omitted (they are optional).
For <crawl> place yes or no here. This is for convenience of being able to keep record of the
sites and their information in this XML file even if you decide that you no longer need to crawl it
anymore (just put <crawl>no</crawl> and the crawl will not crawl that site).
<dirStruct> contains a path to another XML file which is documented in DirStruct.java javadoc. This
element is optional. If no <dirStruct> is given, then every directory will be crawled on the site
and every encountered file will be downloaded.
- Author:
- bfoster
|
Method Summary |
boolean |
addToDownloadQueue(ProtocolFile file,
String renamingString,
File downloadToDir,
String uniqueMetadataElement,
boolean deleteAfterDownload)
|
boolean |
addToDownloadQueue(RemoteSite remoteSite,
String file,
String renamingString,
File downloadToDir,
String uniqueMetadataElement,
boolean deleteAfterDownload)
|
void |
changeToDir(ProtocolFile pFile)
|
void |
changeToDir(String dir,
RemoteSite remoteSite)
|
void |
changeToHOME(RemoteSite remoteSite)
|
void |
changeToRoot(RemoteSite remoteSite)
|
void |
clearErrorFlag()
reset error flag |
void |
clearFailedDownloadsList()
|
boolean |
closeSessions()
Disconnects all downloading Protocol sessions in the avaiableSessions
list. |
ProtocolFile |
getCurrentFile(RemoteSite remoteSite)
|
LinkedList<ProtocolFile> |
getCurrentlyDownloadingFiles()
|
ProtocolFile |
getHomeDir(RemoteSite remoteSite)
|
LinkedList<ProtocolFile> |
getListOfFailedDownloads()
|
List<ProtocolFile> |
getNextPage(ProtocolFile dir,
ProtocolFileFilter filter)
|
ProtocolFile |
getProtocolFile(RemoteSite remoteSite,
String file,
boolean isDir)
|
void |
initialize()
|
boolean |
isAlreadyInDatabase(RemoteFile rf)
|
boolean |
isDownloading(ProtocolFile pFile)
|
void |
registerDownloadListener(DownloadListener dListener)
|
void |
shutdown()
|
boolean |
validate(RemoteSite remoteSite)
|
void |
waitUntilAllCurrentDownloadsAreComplete()
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
FileRetrievalSystem
public FileRetrievalSystem(Config config,
SiteInfo siteInfo)
throws InstantiationException
- Creates a Crawler based on the URL, DirStruct, and Config objects passed
in. If no DirStruct is needed then set it to null.
- Parameters:
url - The URL for which you want this Crawler to crawldirStruct - The specified directory structure located at the host -- use
to train crawler (see DirStruct).config - The Configuration file that is passed to this objects
ProtocolHandler.
- Throws:
InstantiationException
DatabaseException
registerDownloadListener
public void registerDownloadListener(DownloadListener dListener)
initialize
public void initialize()
throws IOException
- Throws:
IOException
clearErrorFlag
public void clearErrorFlag()
- reset error flag
isAlreadyInDatabase
public boolean isAlreadyInDatabase(RemoteFile rf)
throws org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
- Throws:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
getNextPage
public List<ProtocolFile> getNextPage(ProtocolFile dir,
ProtocolFileFilter filter)
throws RemoteConnectionException
- Throws:
RemoteConnectionException
changeToRoot
public void changeToRoot(RemoteSite remoteSite)
throws ProtocolException,
MalformedURLException
- Throws:
ProtocolException
MalformedURLException
changeToHOME
public void changeToHOME(RemoteSite remoteSite)
throws ProtocolException,
MalformedURLException
- Throws:
ProtocolException
MalformedURLException
changeToDir
public void changeToDir(String dir,
RemoteSite remoteSite)
throws MalformedURLException,
ProtocolException
- Throws:
MalformedURLException
ProtocolException
changeToDir
public void changeToDir(ProtocolFile pFile)
throws ProtocolException,
MalformedURLException
- Throws:
ProtocolException
MalformedURLException
getHomeDir
public ProtocolFile getHomeDir(RemoteSite remoteSite)
throws ProtocolException
- Throws:
ProtocolException
getProtocolFile
public ProtocolFile getProtocolFile(RemoteSite remoteSite,
String file,
boolean isDir)
throws ProtocolException
- Throws:
ProtocolException
getCurrentFile
public ProtocolFile getCurrentFile(RemoteSite remoteSite)
throws ProtocolFileException,
ProtocolException,
MalformedURLException
- Throws:
ProtocolFileException
ProtocolException
MalformedURLException
addToDownloadQueue
public boolean addToDownloadQueue(RemoteSite remoteSite,
String file,
String renamingString,
File downloadToDir,
String uniqueMetadataElement,
boolean deleteAfterDownload)
throws ToManyFailedDownloadsException,
RemoteConnectionException,
ProtocolFileException,
ProtocolException,
AlreadyInDatabaseException,
UndefinedTypeException,
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException,
IOException
- Throws:
ToManyFailedDownloadsException
RemoteConnectionException
ProtocolFileException
ProtocolException
AlreadyInDatabaseException
UndefinedTypeException
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
IOException
validate
public boolean validate(RemoteSite remoteSite)
waitUntilAllCurrentDownloadsAreComplete
public void waitUntilAllCurrentDownloadsAreComplete()
throws ProtocolException
- Throws:
ProtocolException
addToDownloadQueue
public boolean addToDownloadQueue(ProtocolFile file,
String renamingString,
File downloadToDir,
String uniqueMetadataElement,
boolean deleteAfterDownload)
throws ToManyFailedDownloadsException,
RemoteConnectionException,
AlreadyInDatabaseException,
UndefinedTypeException,
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException,
IOException
- Throws:
ToManyFailedDownloadsException
RemoteConnectionException
AlreadyInDatabaseException
UndefinedTypeException
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException
IOException
isDownloading
public boolean isDownloading(ProtocolFile pFile)
getCurrentlyDownloadingFiles
public LinkedList<ProtocolFile> getCurrentlyDownloadingFiles()
getListOfFailedDownloads
public LinkedList<ProtocolFile> getListOfFailedDownloads()
clearFailedDownloadsList
public void clearFailedDownloadsList()
shutdown
public void shutdown()
closeSessions
public boolean closeSessions()
throws RemoteConnectionException
- Disconnects all downloading Protocol sessions in the avaiableSessions
list. The ThreadPoolExecutor needs to be completely shutdown before this
method should be called. Otherwise some Protocols might not be
disconnected or left downloading.
- Returns:
- True if successful, false otherwise
- Throws:
RemoteConnectionException
Copyright © 1999-2011 Apache OODT. All Rights Reserved.