Streaming gzip JSON file from S3.
One of the most amazing things about node.js is streaming and how you could benefit from it while dealing with a big file. Today in this post, I will show you how to stream a big JSON file from the S3 bucket and perform specific actions while streaming it.
Here is the typescript module that I will explain in detail.
Imagine a backend service that receives events from various sources and stores them as a zipped JSON file into an S3 bucket. This is a pretty standard exercise, and a good example is user events collected from Web and Mobile applications.
Let's assume that you want to look into the collected events and filter the event name called cancelled_purchase and store them separately into a Redshift table or Postgres that later on accessed by the data team.
You have two options:
- Download the files from the S3 bucket, unzip them, read each file separately, and filter the cancelled_purchase events and process them.
- Unzip, filter, and process file while it gets streamed from the S3 bucket.
The first approach needs local storage and most probably a lot of processing power and RAM; you have to clean up after your done with the file.
The second approach is cleaner for few reasons:
- The service only uses a fixed amount of CPU and memory at any given time, no matter how big a file is.
- There is no need for cleanup as no physical file is downloaded.
Jumping to the code to understand it better.
Here is the list of npm packages needed for this project.
This package is used to communicate with AWS and specifically S3 in our case.
This package is used to access a stream of JSON
This package is used to unzip and stream a file.
This package is used to perform actions like filter and map on a stream of JSON
Importing npm packages
Here is the list of packages you have to import
Interface for an exported function
It is always good to create an interface that defines the exported module and helps the client know the contact in advance.
Define a type for the expected event
As explained earlier, The service streams an array of events stored as a gzip file in an S3 bucket;
Create an S3 client that communicates with AWS S3
getObjectStream is a function that holds streaming logic.
The stream is an ongoing process, and it could be interrupted by many events, like an error. A few functions must be created to respond to these events.
Start the stream
S3 client provides the ability to stream a file from the S3 bucket but just by calling.
Once the stream is ready, you can pipe it, and in each pipe, you could perform a specific action.
Unzip the file
The first action to perform is unzipping the file's content as it comes; The Zlib package is used for this purpose.
Access to the right property of the event
Events are stored as an array, and accessing each item of the array is possible by JSONStream. JSONStream does more than this, and you can read about it here.
Creating Observable Stream
To benefit from the functionalities that rxjs offers, the standard stream must be converted to an observable stream. The below module helps with this matter.
This module accepts a Readable stream and then returns an Rx.Observable stream.
User Observable stream
The below code uses the observableJSONStream module to create an observable stream.
Function like filter or map can be used after converting a stream to an observable.
The map function is the best place for performing an asynchronous action like calling an API per event.
Like any other observable, a subscription is needed to get the final result, and in this example, the subscriber receives a promise from the map function.