scraping record.club reviews w/ playwright

i’ve been logging stuff on record.club for a while—it’s basically letterboxd but for music. lots of cool ppl, useful recommendations, high review quality.

but i didn’t wanna just hope the site sticks around forever. also i want my ratings embedded into my blog so they live in my database, alongside all my other stuff. migrating to a new platform later? trivial. record.club shutting down? lol i’m good.

problem: cloudflare bot shields. no free scrapes for me.

solution: full browser automation via playwright.


cloudflare = brick wall

simple http request? you get the “just a moment…” interstitial. the anti-bot curtain. traditional scraping = DOA.

// nope
const response = await fetch('https://record.club/keith/activity');
// -> cloudflare holding screen

but: cloudflare only blocks bots. real browsers get through. so we just… pretend to be one, with playwright.


playwright to the rescue

playwright spins up chromium/firefox/webkit w/ js running, waits for dom updates, handles weird client-side stuff. exactly what i needed.

barebones script

import { chromium } from 'playwright';

async function scrapeRecordClubActivity() {
  const browser = await chromium.launch({ headless: true });
  try {
    const page = await browser.newPage({
      userAgent: 'Mozilla/5.0 ... Firefox/141.0'
    });
    await page.goto('https://record.club/keith/activity', {
      waitUntil: 'networkidle',
      timeout: 30000
    });
    await page.waitForSelector('article', { timeout: 30000 });
    return await extractActivityData(page);
  } finally {
    await browser.close();
  }
}

css selectors: the real boss fight

activity feed items vary a lot (ratings only, ratings+reviews, queue adds). i wrote a selector-roulette fn that tries multiple options until it hits gold. then parse text, links, etc.

important bit: record.club slugs are nice: `artist-album`. so parsing = not too bad.

function parseAlbumFromURL(url) {
  const slug = url.split('/').pop() || '';
  const parts = slug.split('-');
  return parts.length >= 2
    ? { artist: parts[0], album: parts.slice(1).join(' ') }
    : { artist: 'unknown', album: 'unknown' };
}

ratings, dates, dedupe

  • stars: sometimes dom elements, sometimes text like “4/5”. had to check both.
  • dates: “3 weeks ago” style → converted to iso using a lil relative-time parser.
  • dedupe: activity feed shows repeats. i hash artist+album, skip duplicates. easy.

album covers = annoying

getting artwork was the worst bc cloudflare also protects image urls.

  • naive http fetch? nope → html back.
  • canvas extraction? cors nuked it.
  • actual fix: use playwright to load album page, grab <img> src after dom resolves. those urls can then be fetched with proper headers.

debugging + gotchas

  • `page.evaluate` logs won’t bubble out—wire up `page.on(‘console’)` if you want to debug.
  • singles exist too (`/singles/` urls), not just albums. don’t miss them.
  • timeouts matter—pages can hang. always catch and move on.

result 🎉

i now auto-import my record.club ratings as markdown, frontmatter included:

---
artist: "laufey"
album: "a matter of time"
rating: 4
dateListened: "2025-01-25"
cover: "/src/assets/images/db/music/laufey-a-matter-of-time.jpg"
year: 2023
genre: ["jazz", "pop"]
recordClubUrl: "https://record.club/releases/albums/laufey-a-matter-of-time"
---
🔗 imported from record.club

so i can keep using record.club casually while my blog stays in sync. and if i need to migrate, i’ve got a whole markdown archive of my hot music takes. 🤝