scraping record.club reviews w/ playwright
i’ve been logging stuff on record.club for a while—it’s basically letterboxd but for music. lots of cool ppl, useful recommendations, high review quality.
but i didn’t wanna just hope the site sticks around forever. also i want my ratings embedded into my blog so they live in my database, alongside all my other stuff. migrating to a new platform later? trivial. record.club shutting down? lol i’m good.
problem: cloudflare bot shields. no free scrapes for me.
solution: full browser automation via playwright.
cloudflare = brick wall
simple http request? you get the “just a moment…” interstitial. the anti-bot curtain. traditional scraping = DOA.
// nope
const response = await fetch('https://record.club/keith/activity');
// -> cloudflare holding screen
but: cloudflare only blocks bots. real browsers get through. so we just… pretend to be one, with playwright.
playwright to the rescue
playwright spins up chromium/firefox/webkit w/ js running, waits for dom updates, handles weird client-side stuff. exactly what i needed.
barebones script
import { chromium } from 'playwright';
async function scrapeRecordClubActivity() {
const browser = await chromium.launch({ headless: true });
try {
const page = await browser.newPage({
userAgent: 'Mozilla/5.0 ... Firefox/141.0'
});
await page.goto('https://record.club/keith/activity', {
waitUntil: 'networkidle',
timeout: 30000
});
await page.waitForSelector('article', { timeout: 30000 });
return await extractActivityData(page);
} finally {
await browser.close();
}
}
css selectors: the real boss fight
activity feed items vary a lot (ratings only, ratings+reviews, queue adds). i wrote a selector-roulette fn that tries multiple options until it hits gold. then parse text, links, etc.
important bit: record.club slugs are nice: `artist-album`. so parsing = not too bad.
function parseAlbumFromURL(url) {
const slug = url.split('/').pop() || '';
const parts = slug.split('-');
return parts.length >= 2
? { artist: parts[0], album: parts.slice(1).join(' ') }
: { artist: 'unknown', album: 'unknown' };
}
ratings, dates, dedupe
- stars: sometimes dom elements, sometimes text like “4/5”. had to check both.
- dates: “3 weeks ago” style → converted to iso using a lil relative-time parser.
- dedupe: activity feed shows repeats. i hash artist+album, skip duplicates. easy.
album covers = annoying
getting artwork was the worst bc cloudflare also protects image urls.
- naive http fetch? nope → html back.
- canvas extraction? cors nuked it.
- actual fix: use playwright to load album page, grab
<img>
src after dom resolves. those urls can then be fetched with proper headers.
debugging + gotchas
- `page.evaluate` logs won’t bubble out—wire up `page.on(‘console’)` if you want to debug.
- singles exist too (`/singles/` urls), not just albums. don’t miss them.
- timeouts matter—pages can hang. always catch and move on.
result 🎉
i now auto-import my record.club ratings as markdown, frontmatter included:
---
artist: "laufey"
album: "a matter of time"
rating: 4
dateListened: "2025-01-25"
cover: "/src/assets/images/db/music/laufey-a-matter-of-time.jpg"
year: 2023
genre: ["jazz", "pop"]
recordClubUrl: "https://record.club/releases/albums/laufey-a-matter-of-time"
---
🔗 imported from record.club
so i can keep using record.club casually while my blog stays in sync. and if i need to migrate, i’ve got a whole markdown archive of my hot music takes. 🤝