Car Web Scraper

Reasoning

At the start of 2020, before we were all forced to work from home for the rest of the year, I was planning on treating myself to a new (used) car. I went straight to an unnamed used car website, added some filters and hit search. I wasn’t too sure exactly what I was after, nor did I know if I was getting a good price. I decided to save some searches and keep an eye on them, mainly to see the price I was looking at was the standard, but also to see if any new listings crop up!

After about a week of doing this, I had racked up about 15 different saved searches and I found that I was spending about 15-30 minutes every day checking through these manually. It didn’t help that the website I was using didn’t outline which listings were new and which ones weren’t.

At the time I had recently been given a “Python 101” by a colleague. In an attempt to build upon my new found knowledge I decided to automate this process by scraping the used car website and sending myself a list of all new cars that match my searches every day using Python

What did I do?

I created a Python script that takes the URL for any of my saved searches and finds all of the matching cars on the site. This was done by looping through the URLs, grabbing the page source code and then incrementing the page number query parameter (“?page=4”) on every page. Once the last page was reached, the code would move onto the next search URL until all URLs were checked.

Once the page source code for every listing page was obtained, each individual listing was detected and stored. This could then be used to create an object containing each listing’s key info, such as price, mileage, and most importantly the listing ID. Using the ID, the previous day’s results could be compared against the current day’s listings. With this logic, any of the current day’s listing IDs that aren’t present in the previous day’s data would be a NEW listing.

The Python script was updated to get the previous day’s data from AWS’ DynamoDB, and compare the current day’s data against it. At the very end of the script, the current day’s data was pushed to the database, overwriting the current data.

So now that there was a way to check and compare the listings, I needed to come up with a way of letting myself know about the new listings. This is where I used AWS’ SES (Simple Email Service). I updated the Python script to loop through each new listing and generate a simple email with some essential info, such as price, mileage, an image of the car and a link to the listing. As the email was only ever intended to be sent to myself, it wasn’t styled to look particularly impressive.

Now that everything was set up and working, I packaged the code and dependencies into a zipped folder and uploaded it as an AWS Lambda function. I then used AWS CloudWatch to set up a Cron job which would ensure that my new code would execute every day at 8am, meaning that I could check the new listings on the train each morning!

AWS Services Used

Python Packages Used

Example of an Email sent by the AWS Car Web Scraper