Client Background
A law research firm needed bulk access to case opinions, filings, and dockets from public court websites for analytics.
The Challenge
Court sites used different CMS platforms, required multi‑step navigation to reach PDFs, and enforced request rate limits.
Objectives
✦ Crawl multiple court domains for case metadata and full‑text PDF
✦ Maintain audit trail of source URLs and timestamps
✦ Export to JSON + file storage
Our Approach
𝐃𝐨𝐦𝐚𝐢𝐧 𝐌𝐚𝐩𝐩𝐢𝐧𝐠: Cataloged 10 court sites and their navigation flows
𝐇𝐲𝐛𝐫𝐢𝐝 𝐅𝐞𝐭𝐜𝐡𝐢𝐧𝐠: Used Scrapy for HTML metadata; Selenium for PDF downloads behind JS buttons
𝐑𝐚𝐭𝐞 𝐌𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭: Implemented exponential backoff and request quotas
𝐀𝐫𝐜𝐡𝐢𝐯𝐢𝐧𝐠: Uploaded PDFs to AWS S3; stored JSON metadata in MongoDB
Results & Impact
✦ Harvested 25K cases in first month with full-text PDFs
✦ Ensured 100% auditability via URL + timestamp logs
✦ Saved legal researchers dozens of hours per week
Tools & Technologies
Python, Scrapy, Selenium, MongoDB, AWS S3
Client Testimonial
“Their scraper gave us instant access to thousands of cases—we couldn’t have done it manually in a lifetime.”