How I Scratched My Own Itch is a series wherein I recount a project I wrote for myself. I discuss the problem, solution, and lessons learned in building it.
Context
My girlfriend, Addie, and I have been on a cross-country road trip throughout 2022. To keep us entertained and to broaden our horizons, we thought it would be fun to listen to Rolling Stone's "500 Greatest Albums of All Time" list.
Boy, were we right! Lots of great, new music. My favorites so far:
If You're Feeling Sinister by Belle and Sebastian at #481
Journey in Satchidanada by Alice Coltrane at #446
Cosmo's Factory by Creedence Clearwater Revival at #413
Beauty and the Beat by The Go-Gos at #400
Note: As of this publish date, we have only gotten through the bottom 114 albums.
The Problem
It was frustrating to discover which albums were up next. Each of the entries on the Rolling Stone site has a blurb justifying each album's ranking by detailing its historical and musical significance. Addie and I love reading the blurbs.
Rolling Stone's site is a long list sectioned into 50 albums per advertisement-riddled page. On the rural roads of Wyoming over spotty data networks, it could take well over four minutes to load the page, let alone scroll to the album we were on.
We knew there was a better way.
Solution
The solution we devised was two part:
Part 1. Scrape the Rolling Stone site.
We could distill the Rolling Stone pages down to only the information we cared about.
Fortunately, listicles such as the Top 500 Albums list are rooted in some sort of repeatable format within a content management system. This allowed us to select the album elements by class name, extract the information we needed, and save it to CSV.
The import script we created can be found here: rs500-simplified/import.js
Part 2. Serve this scraped data in a simple web format.
Now that we had the data we cared about, we could build a simple, ad-free site.
Part 2a. Setting up the data
We scraped the pages we needed and saved them into a Google Sheet. This would serve as our backend - a purely static, uneditable sheet. Since we were not updating any cells, this was a perfect solution. No need for a "true" database - we are only reading/selecting information.
With this setup, we could use Python's googleapiclient
package to query this Google Sheet.
Part 2b. Building the website
We wanted to get this deployed as soon as possible, as we started working on this a few days before our drive from Sun Prairie, Wisconsin to Fayetteville, Arkansas (11 hours!).
To do so, we used Python and Flask to create a simple website. When someone queries the /album/<rank>
endpoint, we pull in the Google Sheet row corresponding to the rank. Then, we're able to render a template based on the data returned by the Google Sheet row.
It's a very simple site, but that's the point! No ads, just info!
The Flask server we made can be found here: rs500-simplified/app.py
Part 2c. Deploying the website
I've had some experience deploying sites on PythonAnywhere (now that Heroku is paid only...), which is why we chose Flask. We didn't need the architecture that Django provides.
We needed simple, fast, and efficient.
So we:
Got the
googleapiclient
all set up on a PythonAnywhere serverCreated a
.env
file with our environment variablesEdited the WSGI configuration according to PythonAnywhere's guidelines
And hit deploy!
And bada-bing-bada-boom, we were greeted with this simple page when we accessed album #481:
It loads SO much faster than the Rolling Stone site but still has all the information we want! And, we can load the next album with ease.
Lessons Learned
1. If you have the luxury to do so, avoid perfection.
This application was a great example of the 80/20 rule and diminishing returns.
There were so many opportunities to do things "right" and make our application bulletproof but we ignored them for the sake of speed. We don't need a homepage. We don't need a full-proof web scraping solution (see Lessons Learned #2). We don't need any fancy templating.
We built this for ourselves and the minimum viable product was good enough.
2. A little manual work can go a long way.
Initially, when we were trying to build the web scraper, we wanted to execute the script once and have it pull in every entry without fail. However, the Rolling Stone site is constructed in a way where the URL doesn't change and the page loads are unpredictable. We tried an approach whereby we attempted to accommodate these hiccups but ran into issue after issue.
Eventually, we realized we could go one "loaded" page at a time. We may have to execute a simpler script ten times, but the overall time savings would be massive.
3. Use the right technology for the job.
Even though this was my first time with Flask, I had a feeling it would work perfectly. You can just spit out routes without having to set up massive infrastructure like my Django experiences.
A Google Sheet backend was the right choice due to simply pulling data. We didn't need anything more complex than that. Addie and I were able to work on the data at the same time in a very clear and visible manner.
Thanks for reading!
If you want to check out the repository, click here.