Several years ago, my church (Peace Lutheran Church) began hosting audio files of the Sunday Sermon on the church web site. In the fall of 2023, I felt compelled to archive all of the audio files locally on my own storage server. The site was hosted on Squarespace and all of the audio files were stored within Squarespace's media library. The sermon audio was accessed at /sermons, which would present a page with six sermons and a link at the bottom of the page to view older sermons.
Each of the sermon entries contained the following information:
At the bottom of the entry, there was an embedded audio player that contained a download link for the static audio file used.
Wayback Machine Archive of Squarespace Sermons Page
With all of this information available on the site, I decided that it would be trivial to employ some scraping techniques to download each of the sermon files and keep them organized on the local server. Since I've been using Python for a lot of programming the past few years, I used Python and the BeautifulSoup library to programmatically download each of the available sermon files.
Beginning at https://www.peaceinaiken.com/sermons, each sermon on the page was encapsulated within a DIV element. Each element contained most of the information needed to download the audio file and some relevant metadata about the file:
<div
class="sqs-audio-embed"
data-url="https://static1.squarespace.com/static/52e9496be4b0f93e56e46cfd/t/613a23444c27de3da961d410/1642276897733/Mark+7+Sermon.mp3/original/Mark+7+Sermon.mp3"
data-mime-type=""
data-title="See What God Does With Hearts"
data-author="Vicar Ethan Schultz"
data-show-download="false"
data-design-style="minimal"
data-duration-in-ms="1463000"
data-color-theme="dark"
>
The only interesting parts of this effort were getting the data elements used to build the ID string. These were stored in a DIV element above the element containing the sermon audio data. Accessing the information using the previous_element method three times made this information available:
id_div = div.previous_element.previous_element.previous_element
id = id_div.get("id")
#print(id_div)
#Go Back Two DIV Elements to Get ID String:
#</div></div><div class="sqs-block audio-block sqs-block-audio" data-block-type="41" id="block-yui_3_17_2_1_1631199690387_61066"><div class="sqs-block-content">
Once this was accessible, the information needed to assemble the filename could be extracted:
id_elements = id.split('_')
elements = link.split('/')
#timestamp = int(id_elements[5]) / 1000
timestamp = int(elements[7]) / 1000 #timestamp is in milliseconds, but fromtimestamp() expects seconds
date = datetime.datetime.fromtimestamp(timestamp).strftime("%Y-%m-%d")
name_end = elements[8].replace('+', ' ')
name = date + ' ' + author + ' ' + title + ' ' + name_end
for char in invalid:
name = name.replace(char, '')
The date value was extracted from the div element with the unique ID string. Using the split() method, the string was divided by the underscore character and the values were stored in the id_elements variable. This value is actually a type of epoch timestamp in seconds, instead of milliseconds. This was fairly trivial to accomplish, using the datetime library and the fromtimestamp() method. The final date value is formatted to the desired format, yyyy-mm-dd, which was generated using the strftime() function. Despite the date value being presented on the media page with the other sermon metadata, it was simpler to do it this way. This way avoided the need to maintain a relation between values collected from different pages.
With the filename value set, a request would then be made to download the audio file contents from the MP3 link, and then write the audio information to disk. These operations were contained in a for loop, so the program would iterate through each of the six sermons presented on the page. Once the loop was exhausted, it would proceed by getting the address from the Older Posts link and make a recursive function call using the new address:
for a in soup.find_all('a'):
if (a.getText() == "Older Posts"):
link = a.get("href")
if "?offset=" in link:
#print(link)
offset = link.replace("/sermons", "")
next_url = base_url + offset
#print(next_url)
get_links(next_url)
Using this bit of Python, I was able to download the entire sermon archive within a few minutes. One thing I never implemented was a way of checking for the existence of previously downloaded files. Since I only ran this every few months, I handled this manually by deleting the files that were previously downloaded.
If you'd like to view the full source of the Squarespace sermons page, you can download it here: Sermons Source
The original Python script can be downloaded here: Squarespace Sermons Download
This method worked well-enough until this past fall, when we migrated the church's web site to Subsplash. The new hosting platform provided a new method for storing and presenting media, including audio. However, it's not terribly different from what was being done on Squarespace. I decided I would again use Python to create a new script which would archive the audio files from the new site.
The new site has a media section, Peace Media, where the sermon audio can be accessed:
However, on this platform, there is not an embedded media player provided on this page for each audio file. Clicking on one of the sermons takes the browser to a new page just for that sermon. This page, for example the Christmas Day sermon Christmas Day - Emmanuel, contains an embedded player for the audio file and some additional information about the sermon:
Although the embedded player is presented on this page, there's an additional level of complexity. The embedded player is actually contained in an iframe and sourced from another file:
<iframe src="https://subsplash.com/+22px/embed/mi/+fzbf56b?&video&audio&embeddable&shareable&watermark" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
Within the file referenced by the iframe, the MP3 file is defined by a URL contained in an audio tag:
<audio preload="metadata">
<source src="https://cdn.subsplash.com/audios/5G6VH2/36562330-4de8-4a71-97bb-ab6c7cce9800/audio.mp3" type="audio/mp3">
</audio>
This required some additional code to retrieve the URL for the audio file, but, overall, didn't increase the complexity of the program significantly. I began the program by creating a function to retrieve all of the links to the sermon pages. This was fairly trivial, since each sermon page link was presented within an a tag with a specific class:
<a class="sp-media-item" href="/media/fzbf56b/christmas-day-emmanuel">
<div class="sp-media-thumb" style="color:#143743;background-color:#3e4b48;background-image:url(https://images.subsplash.com/image.jpg?id=f5f07752-d7cd-4d4d-82f6-a4db97280521&w=800&h=450);"><div class="sp-media-play-overlay"></div></div>
<div class="sp-media-title">Christmas Day - Emmanuel</div>
<div class="sp-media-subtitle">Dec 25, 2024 <span style="font-size:.8em;">•</span> Pastor Simeon Crass</div>
</a>
Using the class string value, I was able to collect all of the sermon entries on the page with the following loop:
for a in soup.find_all('a', class_='sp-media-item'):
link = a.get("href")
At this point, another difference has been encountered. Like the Squarespace media page, the sermon metadata is provided on the media page. However, they are contained within child elements of the main a tag. To collect the metadata values needed to assemble the filename string, I used the findChildren() method get the values and store them in a list:
divs = []
for div in a.findChildren('div'):
divs.insert(1, div.text)
Additionally, the date and preacher name values are stored within the same element. To get the individual values, I used the split() method to extract them from the string:
subs = divs[1].split(" • ")
date = subs[0]
preacher = subs[1]
When all of the values have been collected for a sermon, they're stored in a list and appended to a global list variable:
return_data = [base_url + link, title, date, preacher]
media_data.append(return_data)
Afterward, since all 15 sermons on the current page have been collected and stored, we'll get the URL for the next page from the Older Posts button. This button is at the bottom of the page, placed there with the following:
<div style="text-align:center;"><a href="/media/page/2/" class="sp-post-older-button">Older <i class="fa fw fa-angle-right"></i></a></div>
After getting the value of the href and appending it to the base URL, the program makes a recursive call to the get_sermon_pages() function and proceeds with collecting data for the remaining sermons. At the end of the function execution, it returns the global list variable.
At this point in the program's execution, we have links for all of the sermons stored on the site, along with their titles, preacher names, and the dates they were given. As the program continues execution, it will iterate through the global list variable to get each of the sermon link, title, preacher name, and date values. Then, since the date format isn't in a filename-friendly format, we need to reformat it. We can do this by parsing the current date value with the datetime library and then using the strftime() method to format the date string to yyyy-mm-dd format for the filename. Two lines of code accomplish this:
datetime_res = parse(date, parserinfo=None)
date = datetime_res.strftime('%Y-%m-%d')
Now, the only thing I needed to complete the task was the actual audio data. Having the link to the sermon page, I could use it to get the link for the iframe, and then use that link to get the link to the audio file. Since each of these links was within a unique tag on their respective pages, I wrote two functions to get the needed values:
def get_iframe_data(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for iframe in soup.find_all('iframe'):
link = iframe.get("src")
return link
def get_mp3_link(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for iframe in soup.find_all('source'):
link = iframe.get("src")
return link
Passing the sermon page link to the get_iframe_data() function returns the link to the iframe page. Subsequently, passing the iframe page link t the get_mp3_link() function returns the link for the MP3 file. Once that's been obtained, the same requests.get() method that was used previously to download the audio file data. The data is then written to a file on the local server.
Once I'd completed all of this functionality and verified that it worked, I decided to implement a way to automatically determine whether a file had previously been downloaded and skip it if it had. I decided to use a simple SQLite database and table to accomplish this.
After ensuring that import sqlite3 was added to the code, I created three more functions to handle the database operations. The first function, db_create_table(), created a new sermons table if one didn't exist already. In the event that the peace_sermons.db file doesn't exist yet, and a new one is created when the function is called, the sermons table will be created with the correct schema. To ensure that the database and table is always available before the rest of the program executes, this function is the first action the program takes upon execution:
def db_create_table():
conn = sqlite3.connect('peace_sermons.db')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS sermons (date_download, date_sermon, title, file, link)''')
conn.commit()
cursor.close()
conn.close()
The next database function that gets called is the db_check_link() function. It's called as the program is iterating through the list elements in the global list variable (media_data). If this function finds a result for the link value passed to it, then it returns True. The True value will cause the program to continue to the next iteration in the loop. Otherwise, the program will proceed with retrieving the audio file data.
def db_check_link(url):
conn = sqlite3.connect('peace_sermons.db')
cursor = conn.cursor()
query = "SELECT * FROM sermons WHERE link = ?"
cursor.execute(query, (link,))
records = cursor.fetchall()
cursor.close()
conn.close()
if len(records) > 0:
return True
else:
return False
The final database function is db_insert(), which is used to insert sermon records into the table. Five values are passed to the function, which include the link to the sermon page and the sermon metadata.
def db_insert(date_download, date_sermon, title, file, link):
conn = sqlite3.connect('peace_sermons.db')
cursor = conn.cursor()
query = "INSERT INTO sermons (date_download, date_sermon, title, file, link) VALUES (?, ?, ?, ?, ?)"
cursor.execute(query, (date_download, date_sermon, title, file, link,))
conn.commit()
cursor.close()
conn.close()
Employing the use of these functions, the previously downloaded audio files are skipped. This reduces the total execution time, the amount of throughput required to run it, and the need for any manual intervention. This program can be run as a scheduled job. Overall, despite the added complexity, this is an improved method for downloading the sermon files over the previous method.
I've made the source to the Subsplash pages available:
The program, in its entirety, is also available: scrape_peace_subsplash_sermons.py