Slimming down RSS feed with make and pandoc
A few days ago, I looked into the Arsouyes site's traffic and discovered that 97.8% of visitors were bots. Looking closer, I also found that aggregators coming to fetch our RSS feed on behalf of our fans account for 8.5% of visitors but generate 36% of the server's outgoing traffic.
On one hand, it's true that some of these bots are a bit dumb. Some come by way too often (up to 350 times per day for Nextcloud) and others request the full feed every time (up to 170 times per day totaling 60 MB of data). Even though the site (and thus the feed) is only updated once a week.
On the other hand, we also share some of the blame for this relatively high volume generated by RSS readers:
- Our pages are light. If we used images everywhere, custom fonts and loaded tons of JavaScript1, the pages would be heavier, the generated traffic would be higher and, by comparison, the RSS would be tiny.
- Our RSS is too big. During some n^th update I can't remember, I started copying the full content of articles into the feed. Even if there are only 20 articles (and not the 158 the site counts up to this one), 3000 words is significantly heavier than a short description.
Since I can't change bot behavior (that's their responsibility) and I don't want to bloat my pages just to make the RSS proportionally smaller, I set out to reduce its size. And while I'm at it, I'll explain how I build it using small commands orchestrated by a makefile that generates the site.
make for orchestration
We could have coded something to generate the RSS feed by itself, but since we already had a makefile describing how to compile our markdown pages to HTML, it made more sense2 to extend it to handle the RSS file as well.
Rather than giving you the full makefile content3, here's a diagram of the general workflow. In blue you have the source files (articles in markdown and some templates), in yellow the commands used (envsubst, pandoc and cat), in green the intermediate files and in purple the final file.
envsubst for replacements
The RSS feed is an XML file that includes an initial part which I call the header, that describes what feed it is. This part barely changes but, to make my life more interesting, there are still some variables to adjust each time:
- The site's address doesn't change, but I wanted to keep the possibility of generating a feed on some sort of staging server that would have its own address. So I have a first parameter "siteurl"
- The feed's title, it doesn't change either but I have a French one and another English one. Rather than making two templates (one per language), and since I already have a parameter anyway, I add a second parameter "title"
- The feed's path, for the same reason as the title, becomes my third parameter "path"
- Same for the language, "lang"
- Finally the publication date, corresponding to when the feed was generated, I call it "pubdate". This is probably the only true parameter in the whole thing and it justifies why I have to generate this part of the file each time.
With these parameters, here's the header template (or header template to the jargon) I use.
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>${title}</title>
<link>${siteurl}</link>
<atom:link href="${siteurl}${path}" rel="self" type="application/rss+xml"/>
<description></description>
<language>${lang}</language>
<lastBuildDate>${pubdate}</lastBuildDate>To replace parameters with their values, I was, until recently, using pandoc. I agree it's well suited for converting documents from one format to another (and it can therefore easily just replace parameters with their values), but I figured it's a bit overkill and I wanted something simpler.
I turned to the envsubst command which is designed for just that. It reads a file, replaces variables defined in the environment and used in the file with their values and outputs the result. It's dead simple4, it's available by default almost everywhere.
I can now show you the portion of the makefile that handles generating the header (SRC_DIR is the variable pointing to the directory with the site source code, SITE_URL is the variable defining the site's live address).
# Definition of useful variables for later
MD_TPL_NEWS_H := templates/rss_header.xml
SRC_NEWS_FR_H := $(SRC_DIR)/rss-header.fr.xml
SRC_NEWS_EN_H := $(SRC_DIR)/rss-header.en.xml
# Tells make these files are temporary and that it
# should delete them at the end.
.INTERMEDIATE: $(SRC_NEWS_FR_H) $(SRC_NEWS_EN_H)
# How to generate the French header
$(SRC_NEWS_FR_H): $(MD_TPL_NEWS_H)
path=/rss.fr.xml \
pubdate="$$(date -R)" \
title="Flux RSS des arsouyes" \
siteurl=$(SITE_URL) \
lang="fr_FR" \
envsubst < $^ > $@
# How to generate the English header
$(SRC_NEWS_EN_H): $(MD_TPL_NEWS_H)
path=/rss.en.xml \
pubdate="$$(date -R)" \
title="Arsouyes' RSS" \
siteurl=$(SITE_URL) \
lang="en_UK" \
envsubst < $^ > $@The general idea is pretty standard for a makefil;, here are the few subtleties I used:
- The
.INTERMEDIATEtarget, to force make to treat the files in question as temporary and delete them when it's done generating the site. Without this line, these temporary files pollute the source code andgit5 complains about them every time I want to see what I need to commit6. - I must pass my parameters to
envsubstas environment variables. No need to export these variables to the environment (with theexportcommand); I can just define them at the start of the command line (hence the\at the end of lines so they forme a single command).
One might want to replace these two big large recipes with a single more generic one that would take the language as a parameter (using a % in the rule) and use some sort of configuration file per language. But since these parameters are only used in this one place, I find these config files superfluous.
pandoc for compilation
The RSS file includes, after this header, a list of items corresponding to each article on the site7. Here there are more parameters:
- The article title, the publication date and a description (I'll get back to this later),
- The page address, and its identifier (here I use the link as the identifier, it works and it's more than enough),
- Categories and authors.
Since this information is stored in the article source files8 and pandoc is well suited to handle it, this time we use a pandoc template.
<item>
<title>$title$</title>
<link>$siteurl$$path$</link>
<guid>$siteurl$$path$</guid>
<pubDate>$date-rss$</pubDate>
<description><![CDATA[
$if(description)$$description$
$elseif(abstract)$$abstract$
$endif$
]]></description>
<author>$if(author)$$for(author)$$author$$sep$ & $endfor$$else$Les arsouyes$endif$</author>
$for(keywords)$<category>$keywords$</category>$endfor$
</item>Getting back to the reason I decided to review the RSS generation - namely, to slim it down - this is where the template comes in.
- Before: I placed the whole article body in the description (with the
$body$variable), - Now: I only place the description (the
$description$variable) if there is one; if not I use the spoiler (the$abstract$variable); otherwise, I leave it blank.
Not only will the file be lighter, but it will respect what I said in 2023: favor the content on the site and help the reader to identify who's talking to them. If some of them prefer to read content through their aggregators, they all have an option to download the full article9.
January 23rd edit. The date in RSS files must follow RFC822. However, the dates in our markdown files follow ISO 8601...
Since I don't want to change the date format in the file, and I think having a second one in another format would be redundant, I added a Lua filter to pandoc to handle it for me.
function Meta(meta)
if meta["date-iso"] then
-- number extraction
local format = "(%d+)-(%d+)-(%d+)T(%d+):(%d+):(%d+)(+%d+):(%d+)"
local year, month, day, hour, minute, second, tz1, tz2 =
pandoc.utils.stringify(meta["date-iso"]):match(format)
-- Date creation, then format with RFC822
local date = os.time({
year=year, month=month, day=day,
hour=hour, min=minute, sec=second
})
date_string = os.date("%a, %d %b %Y %H:%M:%S", date)
.. " " .. tz1 .. tz2
-- Adding the date to metadata
meta["date-rss"] = pandoc.Str(date_string)
return meta
end
endThis filter will parse the ISO date format (or at least the variant I use in our files) to extract the numbers. Then I let Lua create a date and format it according to RFC822, and finally, I add this info to the metadata. End of edit.
This time the makefile rules are a bit more numerous and verbose, but nothing too crazy, really... PANDOC_FLAGS contains specific arguments passed globally to pandoc and MD_FORMAT is the format (and its chosen extensions) I use everywhere.
# Template file
MD_TPL_NEWS_ITEM := templates/rss_item.xml
# Lua filter file
FILTER_RSS := filter-date-rss.lua
# Latest articles list per language
SRC_NEWS_FR := $(shell find $(SRC_DIR)/articles -name "index.fr.md" | sort | tail -n 20)
SRC_NEWS_EN := $(shell find $(SRC_DIR)/articles -name "index.en.md" | sort | tail -n 20)
# Temporary files list for each item in RSS feed
SRC_NEWS_FR_ITEMS := $(SRC_NEWS_FR:.md=.rss.xml)
SRC_NEWS_EN_ITEMS := $(SRC_NEWS_EN:.md=.rss.xml)
.INTERMEDIATE : $(SRC_NEWS_FR_ITEMS) $(SRC_NEWS_EN_ITEMS)
# French items generation
%.fr.rss.xml: %.fr.md $(MD_TPL_NEWS_ITEM) $(FILTER_RSS)
pandoc $(PANDOC_FLAGS) \
-M path=$(patsubst $(SRC_DIR)%,%,$(dir $@)) \
-M siteurl=$(SITE_URL) \
--template $(MD_TPL_NEWS_ITEM) \
--lua-filter=$(FILTER_RSS) \
-f $(MD_FORMAT) \
-t html \
-o $@ \
$<
# English items generation
%.en.rss.xml: %.en.md $(MD_TPL_NEWS_ITEM) $(FILTER_RSS)
pandoc $(PANDOC_FLAGS) \
-M path=$(patsubst $(SRC_DIR)%,%,$(dir $@)) \
-M siteurl=$(SITE_URL) \
--template $(MD_TPL_NEWS_ITEM) \
--lua-filter=$(FILTER_RSS) \
-f $(MD_FORMAT) \
-t html \
-o $@ \
$<I also use the intermediate files trick10 here so no need to go over that again. What's new is the list of files to include in the RSS... Im pulling it out here:
find $(SRC_DIR)/articles -name "index.fr.md" \
| sort \
| tail -n 20I use find to find the source files corresponding to the articles in the language I want. I pass this list to sort to have them in order (the path name includes the date or a sequence number so it sorts correctly). Finally, I pass the list to tail to only keep the last 2011.
cat for concatenation
The RSS feed ends with an epilogue of two XML tags that simply close the tags left open above. No variables to replace here.
</channel>
</rss>So we have all the RSS file contents and it's just a matter of putting them together end-to-end. First the header, then the items, and finally the epilogue. Since it's just about concatenating files, the cat command is more than enough for this.
Here are the corresponding rules in the makefile. DST_DIR contains the directory name where the site files should be written and DST_FILES contains the list of files to generate.
# Epilogue template
MD_TPL_NEWS_F := templates/rss_footer.xml
# Complete RSS file
DST_RSS_FR := $(DST_DIR)/rss.fr.xml
DST_RSS_EN := $(DST_DIR)/rss.en.xml
# We produce these files
$(DST_RSS_FR): $(SRC_NEWS_FR_H) $(SRC_NEWS_FR_ITEMS) $(MD_TPL_NEWS_F)
cat $^ > $@
$(DST_RSS_EN): $(SRC_NEWS_EN_H) $(SRC_NEWS_EN_ITEMS) $(MD_TPL_NEWS_F)
cat $^ > $@
# We add them to the file generation list
DST_FILES += $(DST_RSS_FR) $(DST_RSS_EN)And after?
Even if the makefile syntax isn't the most beautiful, I find it has a kind of elegance in file generation. Rather than one big script that does everything, we have here a range of small, simple, and autonomous tasks12.
As expected, the generated files are lighter than before. The two RSS files went from around 350 KB to around 9 KB. That's 2.5% or a reduction by a factor of 40 (roughly). Some aggregators, smarter than average, ask if the file has changed before requesting the content, which will reduce the savings a bit, but by how much?
To get real numbers, I published this new RSS on Monday the 19th in the evening (at around 6:30pm), waited a day then pulled the logs since the previous article was published. Meaning from January 15th to 20th, 2026 inclusive. I used the same commands but I no longer try to identify humans vs. bots but the RSS proportion versus the rest.
Yesterday (January 20th), RSS represents 10% of visitors for about 38% of requests but only 3% of the volume! Sure, they still make plenty of requests, but now it doesn't waste as much bandwidth.
Over time, aside from a traffic spike on the 17th (following publication on the Journal du hacker), the site bandwidth is quite stable these last days. On the other hand, we clearly see the impact of slimming down the RSS in the January 20th volume: it's much lower and RSS has almost disappeared (compared to other days).
To give you an idea of the impact on the visitor proportion, we took the January 15th data from the previous article factoring in the savings that slimming down the RSS represents13. It's quite visible I think; seeing 34% vanish is quite noticeable. But the downside is the volume for Scunners goes from 56 to 85%...
Depressing? No.
Because if we look at the numbers that really count, we realize the bandwidth we send to our human readers (and their bots) has been reduced by about 82% on average per day. In other words, slimming down the feed made it possible to save bandwidth for our readers, without improving that of the Scunners. And that, I think, is good news.