scraping record.club reviews w/ playwright - keith.is automating my music collection

i’ve been logging stuff on record.club for a while. it’s basically the letterboxd energy but for music, with thoughtful people and consistently good recommendations.

but i don’t want to rely on any platform existing forever. i want my ratings stored in my blog and database, somewhere i actually own. if i ever switch platforms later, great, i can bring everything with me. if record.club ever disappears (hopefully not), i still have all my notes, reviews, and general music opinions in a format i control.

cloudflare makes basic scraping impossible, so the solution ended up being full browser automation with playwright.

cloudflare

a simple http request gets stopped by the “just a moment…” screen. it’s the usual bot protection and it blocks everything that doesn’t behave like a real browser.

// nope
const response = await fetch('https://record.club/keith/activity');
// -> cloudflare holding screen

so instead of fighting that, playwright loads the page like any other browser, which gets around the shield entirely.

playwright setup

playwright spins up chromium/firefox/webkit with javascript enabled, waits for dom changes, and handles whatever client-side quirks the site has. it ended up being the easiest way to get access to the feed.

import { chromium } from 'playwright';

async function scrapeRecordClubActivity() {
  const browser = await chromium.launch({ headless: true });
  try {
    const page = await browser.newPage({
      userAgent: 'Mozilla/5.0 ... Firefox/141.0'
    });
    await page.goto('https://record.club/keith/activity', {
      waitUntil: 'networkidle',
      timeout: 30000
    });
    await page.waitForSelector('article', { timeout: 30000 });
    return await extractActivityData(page);
  } finally {
    await browser.close();
  }
}

css selectors

the activity feed is inconsistent: sometimes it’s just a rating, sometimes a full review, sometimes queue additions. i ended up writing a small helper that tries multiple selectors until something matches. once i have the container, i parse text content and links.

record.club slugs help a lot here (artist-album), so extracting basic metadata isn’t too bad.

function parseAlbumFromURL(url) {
  const slug = url.split('/').pop() || '';
  const parts = slug.split('-');
  return parts.length >= 2
    ? { artist: parts[0], album: parts.slice(1).join(' ') }
    : { artist: 'unknown', album: 'unknown' };
}

ratings, dates, duplicates

ratings show up either as icons or as text like “4/5,” so i detect both.
dates are relative strings (“3 weeks ago”), which i convert to iso timestamps with a small parser.
the activity feed repeats items sometimes, so i hash artist+album and skip duplicates.

album covers

cover images were the most annoying part. cloudflare protects some image urls, and direct fetches return html instead of the image.

the workaround was loading the album page in playwright, waiting for the dom to settle, and reading the <img> src from there. those sources can then be fetched normally with the right headers.

debugging and small issues

page.evaluate logs don’t show unless you attach page.on('console').
some items are singles (/singles/), not albums, and the structure varies.
timeouts matter a lot; some pages just hang and you need to catch and continue.

result

now my record.club ratings auto-import into my blog as markdown with frontmatter:

---
artist: "laufey"
album: "a matter of time"
rating: 4
dateListened: "2025-01-25"
cover: "/src/assets/images/db/music/laufey-a-matter-of-time.jpg"
year: 2023
genre: ["jazz", "pop"]
recordClubUrl: "https://record.club/releases/albums/laufey-a-matter-of-time"
---
🔗 imported from record.club

i can keep using record.club normally while my actual archive lives inside my blog and database, where i don’t have to worry about platforms shutting down or anything disappearing.