Reasoning
My dad is really into his wine. After showing him the car scraper that I’d set up for myself, he requested that I sort out something similar for him but for wine instead. He told me that every day he goes through a list of wines that he’s after to check if they’re in stock.
Initial Steps
As I had previously created something similar, I was aware of some potential issues that I would encounter as well as some processes that could be improved upon.
Packaging, Testing and Pushing Code to AWS Lambda
When working with AWS Lambda previously, I used a really tedious way of testing and pushing my Lambda function code. I had to package, zip and manually upload my code to AWS’ console every time I made a change to the code. To combat this issue this time around I used the AWS SAM CLI. This allows me to build, test, debug and easily publish my Lambda function from my local machine, GREAT!
Creating a way to update wines easily
For my previous scraper project it was only for my personal use. With this in mind, it was set up without an easy to use UI to update the searches. As my dad would be the main user of this tool, I had to come up with a way for him to simply manage his desired searched. I toyed around with using a cloud storage based text file or something similar, but this didn’t seem like the most intuitive method. I finally settled on creating a small React application which would be hosted on Netlify and would have authentication using Netlify Identity.
The next hurdle was to think of a way to actually do the getting, creating and removing of wine searches. I planned on creating a simple REST API using AWS API Gateway to handle this, but I don’t want just anyone to be able to do this. Luckily with Netlify Identity and Netlify Functions, I am able to create authenticated functions that can only be performed by users with the appropriate permissions to do so.
Back-End
I decided to use NodeJS for my web scraper Lambda function, as this was a more complex setup than my previous car scraper I thought it would be better to use a language I’m very familiar with.
Once I had my AWS SAM project set up, and I was successfully logging the classic “Hello World” from the cloud, I was able to create the scraper logic.
Firstly, I set up an AWS DynamoDB database that would store all of the wine search queries. This way, we could get the wine search queries (and their URLs) at the start of the scraper function.
Now that there was the ability to gather the desired search URLs, I used a npm package called ‘axios’ that would loop though and request the source code from each wine search URL listing page, and then increment the page number query parameter (e.g. “?page=8”) until the last listing page was reached. When the last listing page is reached for a search URL, the next wine search URL would be requested until all searches were requested.
When all search listing pages were gathered, I used another npm package called ‘cheerio’ which “parses markup and provides an API for traversing/manipulating the resulting data structure”. Basically it allows me to query the HTML for specific items using CSS selectors. With this, I gathered an array of info from all of the wines that had been scraped, such as a listing ID, image, link, price etc.
With all of this data gathered, it is now ready to be pushed to a separate DynamoDB database. This was simple enough when using the AWS SDK for NodeJS. The functionality to get the previous data from DynamoDB was added to the start of the function and data could now be compared to the newly scraped data that we gather later in the function.
With both sets of data available they can be compared against one another. A simple loop was set up which checks all of the listing IDs from the old data against the new data. If a price drop was identified, it would be added to a “price drop” array. If a new listing was identified, it would be added to a “new listing” array.
Using the new listing and price drop arrays, an email was generated and sent out using AWS’ SES (Simple Email Service). This email would contain information on the wine, such as prices, image and a link to the wine. If it was a price drop, it would show the old price against the new price. Please find an example of one of these emails below.
I then created a simple Cron job using AWS CloudWatch that would execute the Lambda function twice a day, at 6am and 6pm.
Lastly, I needed to put together a way for the front-end application to interact with the database of searches from outside of the AWS console. For this, I created a simple REST API using AWS API Gateway. I created 3 methods, create (create a search), delete (delete a search) and get (get all searches). API keys were also generated, to stop just anyone tinkering with the database.
Front-End
Now that the scraping and emailing functionality was all set up and working as it should, I needed to create a simple front-end application that would allow users to create and delete search results.
Firstly, I set up the Netlify Identity functionality. As this was my first time using this, it was a really simple set up, all users have the same permissions and will see the same information. I used a npm package called ‘react-netlify-identity-widget’ which makes using Netlify Identity with React a bit simpler. With this in place, any user visiting the site without the login details would simply be redirected to the login page, perfect!
With the authorisation in place, I then had to create the editor page. This, again, was really basic. It has an input box (allowing us to input/create a new search), a list of all the current searches with the ability to see more details and/or delete a specific search and a logout button.
With the front-end all set up, all that was left to do was create three Netlify Functions, which allow the site to get the current list of searches, delete a search and create a new search. The reason that Netlify Functions were used to do this was because it allows the page to check that a user is authorised, and then fire a request to our AWS backend without exposing our API keys.