Packages

c

net.snowflake.spark.snowflake.s3upload

StreamTransferManager

class StreamTransferManager extends AnyRef

Manages streaming of data to S3 without knowing the size beforehand and without keeping it all in memory or writing to disk.

The data is split into chunks and uploaded using the multipart upload API. The uploading is done on separate threads, the number of which is configured by the user.

After creating an instance with details of the upload, use StreamTransferManager#getMultiPartOutputStreams() to get a list of MultiPartOutputStreams. As you write data to these streams, call MultiPartOutputStream#checkSize() regularly. When you finish, call MultiPartOutputStream#close(). Parts will be uploaded to S3 as you write.

Once all streams have been closed, call StreamTransferManager#complete(). Alternatively you can call StreamTransferManager#abort() at any point if needed.

Here is an example. A lot of the code relates to setting up threads for creating data unrelated to the library. The essential parts are commented.


    AmazonS3Client client = new AmazonS3Client(awsCreds);
    int numStreams = 2;
    int numUploadThreads = 2;
    int queueCapacity = 2;
    int partSize = 5;

    // Setting up
    final StreamTransferManager manager = new StreamTransferManager(bucket, key, client, numStreams,
                                                                    numUploadThreads, queueCapacity, partSize);
    final List streams = manager.getMultiPartOutputStreams();

    ExecutorService pool = Executors.newFixedThreadPool(numStreams);
    for (int i = 0; i < numStreams; i++) {
        final int streamIndex = i;
        pool.submit(new Runnable() {
            public void run() {
                try {
                    MultiPartOutputStream outputStream = streams.get(streamIndex);
                    for (int lineNum = 0; lineNum < 1000000; lineNum++) {
                        String line = generateData(streamIndex, lineNum);

                        // Writing data and potentially sending off a part
                        outputStream.write(line.getBytes());
                        try {
                            outputStream.checkSize();
                         catch (InterruptedException e) {
                            throw new RuntimeException(e);
                        }
                    }

                    // The stream must be closed once all the data has been written
                    outputStream.close();
                } catch (Exception e) {

                    // Aborts all uploads
                    manager.abort(e);
                }
            }
        });
    }
    pool.shutdown();
    pool.awaitTermination(5, TimeUnit.SECONDS);

    // Finishing off
    manager.complete();
}

The final file on S3 will then usually be the result of concatenating all the data written to each stream, in the order that the streams were in in the list obtained from getMultiPartOutputStreams(). However this may not be true if multiple streams are used and some of them produce less than 5 MB of data. This is because the multipart upload API does not allow the uploading of more than one part smaller than 5 MB, which leads to fundamental limits on what this class can accomplish. If order of data is important to you, then either use only one stream or ensure that you write at least 5 MB to every stream.

While performing the multipart upload this class will create instances of InitiateMultipartUploadRequest, UploadPartRequest, and CompleteMultipartUploadRequest, fill in the essential details, and send them off. If you need to add additional details then override the appropriate customise*Request methods and set the required properties within.

This class does not perform retries when uploading. If an exception is thrown at any stage the upload will be aborted and the exception rethrown, wrapped in a RuntimeException.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. StreamTransferManager
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new StreamTransferManager(bucketName: String, putKey: String, s3Client: AmazonS3, meta: ObjectMetadata, numStreams: Int, numUploadThreads: Int, queueCapacity: Int, partSize: Int)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def abort(): Unit

    Aborts the upload.

    Aborts the upload. Repeated calls have no effect.

  5. def abort(throwable: Throwable): Unit

    Aborts the upload and logs a message including the stack trace of the given throwable.

  6. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  7. def clone(): AnyRef
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @native() @throws( ... )
  8. def complete(): Unit

    Blocks while waiting for the threads uploading the contents of the streams returned by StreamTransferManager#getMultiPartOutputStreams() to finish, then sends a request to S3 to complete the upload.

    Blocks while waiting for the threads uploading the contents of the streams returned by StreamTransferManager#getMultiPartOutputStreams() to finish, then sends a request to S3 to complete the upload. For the former to complete, it's essential that every stream is closed, otherwise the upload threads will block forever waiting for more data.

  9. def customiseCompleteRequest(request: CompleteMultipartUploadRequest): Unit
  10. def customiseInitiateRequest(request: InitiateMultipartUploadRequest): Unit
  11. def customiseUploadPartRequest(request: UploadPartRequest): Unit
  12. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  14. def finalize(): Unit
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  15. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  16. def getMultiPartOutputStreams(): List[MultiPartOutputStream]
  17. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  18. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  19. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  20. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  21. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  22. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  23. def toString(): String
    Definition Classes
    StreamTransferManager → AnyRef → Any
  24. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  25. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped