The challenge
A legal intelligence platform covering South Florida's civil court system had a fundamental problem: it was running every day, spending money on API calls, and writing zero rows to its production database.
Not zero useful rows. Zero rows, period.
The platform was designed to discover civil depositions scheduled across Broward County, Miami-Dade County, and Palm Beach County — one of the largest legal markets in the United States. Court reporters, legal support firms, and litigation support services depend on this kind of advance intelligence to allocate staff, manage capacity, and respond to client opportunities.
The system had all the right components: a scraper, an LLM extraction engine, a BigQuery data warehouse, and a set of county-specific connectors. But it was using CourtListener as its primary data source for Florida state courts. CourtListener has virtually no Florida state civil court coverage. Everything downstream of that decision — the schema, the extraction logic, the gold views — was accumulating no data because the primary feed was empty.
The engagement
We embedded with the platform team to run a complete technical audit before writing any new code. The rule is read before you write. In this case, reading meant tracing every data path from source to database to understand exactly where the pipeline was failing.
The audit identified five bugs and one foundational data source problem.
The data source problem
The scraper was built around CourtListener's RECAP archive. RECAP is excellent — for federal courts. It covers all three Florida federal districts (Southern, Middle, Northern) with near-real-time docket updates. For federal civil cases in Fort Lauderdale, Miami, and West Palm Beach, it remains the best free source available.
It does not cover Florida state circuit courts.
The 17th Circuit (Broward), 11th Circuit (Miami-Dade), and 15th Circuit (Palm Beach) are the courts where the overwhelming majority of civil depositions occur. They have their own clerk systems, their own APIs, and their own data formats. None of them are in CourtListener.
Fixing the data source wasn't a code change. It was a strategy change: CourtListener stays for federal cases only; Broward Clerk API, Miami-Dade commercial data, and a new Palm Beach scraper take over for state courts.
The Broward fix
The Broward Clerk API was already integrated. It was using the wrong endpoint.
The scraper was calling search_cases_filed — which returns cases by filing date, requiring individual docket pulls for each case to find hearing entries. This approach generates hundreds of API calls per day to discover what the search_hearing endpoint would return in a single call.
GET /api/search_hearing?court_type_code=CV&date={date}&hearing_code=DEP&auth_key={key}
This returns every deposition scheduled in Broward County on a given date in one API request. We switched to this endpoint and immediately had real hearing data flowing into the pipeline.
The Miami-Dade fix
Miami-Dade was the most expensive problem. The connector was generating sequential case numbers and probing the per-unit API at $0.20 per request. With a daily budget cap of $20, this allowed 100 probes per run. The case number estimation logic was wrong for 2026 numbers, so the hit rate was near zero.
The platform was spending $600/month on API calls that returned nothing.
Miami-Dade Clerk offers a Commercial Data Services program: a bulk FTP subscription to the Civil feed for $110/month. Daily bulk file dumps, weekly consolidated files, full docket events — everything the probe loop was trying to extract at a fraction of the cost.
We stopped the probe loop immediately (a five-minute change), submitted the notarized registration form, and rewrote the Miami-Dade connector to consume the bulk feed when credentials arrived.
Cost impact: $600/month to $110/month.
The Palm Beach fix
Palm Beach County (15th Circuit) had no connector at all. The third-largest legal market in South Florida was entirely absent from the data warehouse.
The county has no API. Their eCaseView portal is a JavaScript-rendered web application that accepts case type and date range searches. We built an LLM-powered scraper using Playwright for navigation and GPT-4.1-mini for structured extraction from the HTML results.
Cost to run: approximately $0.02/day. Each page of results contains about 20 cases; at 500 new civil cases per day in Palm Beach, the daily LLM extraction cost is essentially rounding error.
The five code bugs
Bug 1 (one-character fix, total pipeline blockage): The LLM engine defined its extraction method as extract_proceeding_info(). The orchestrator called it as extract_proceeding(). One character difference. Zero LLM extraction results from day one.
Bug 2 (five-minute fix, $600/month savings): The Miami-Dade probe loop was the most expensive bug. Commenting out one function call stopped the waste.
Bug 3 (two-hour rewrite): Switching Broward from search_cases_filed to search_hearing required rewriting the discovery loop. The new endpoint returns structured hearing data directly; the old approach required individual docket pulls for each case.
Bug 4 (30-minute fix, 90% LLM cost reduction): The extraction engine was calling the LLM on every docket entry — including "Notice of Appearance," "Summons Issued," and "Filing Fee Paid." A regex pre-filter gate that checks for deposition-related keywords before any LLM call eliminated 90% of unnecessary inference calls.
Bug 5 (one-hour fix): The Florida Bar member directory scraper had no rate limiting or caching. It was hitting the public search interface without delays, triggering 429 errors. Adding a two-second delay between requests and writing successful lookups to a BigQuery cache table resolved both problems and turned the cache into a self-building attorney database.
The result
After fixes, the platform was discovering hundreds of civil proceedings daily across Broward, Miami-Dade, and Palm Beach counties. The database moved from zero rows to a growing, queryable record of scheduled depositions, hearings, and civil proceedings.
Total monthly infrastructure cost dropped from approximately $680 to $173 — a 75% reduction — while coverage expanded from zero to three counties.
The schema migration from deposition_notices to proceeding_events reflected the broader shift: the platform now tracks all civil proceedings, not only depositions, which opened additional market applications for the intelligence.
Lessons
- Trace the pipeline before writing new code. Every hour spent auditing saved days of building in the wrong direction. The data source problem would have persisted through any number of improvements to downstream components.
- Free data has a cost model too. The Broward approach of pulling individual dockets to find hearings was technically correct but economically irrational once the direct hearings endpoint existed. Understanding the API's full endpoint set before building changed the cost model entirely.
- Bulk subscriptions beat probe loops. For well-organized public data sources, bulk subscriptions are almost always cheaper, more reliable, and more complete than API-based discovery. The $490/month saved on Miami-Dade paid for the engagement in two months.
- One wrong method name can silence an entire system. The LLM extraction failure was a single character. Systems that fail silently — returning empty results rather than throwing exceptions — make this kind of bug invisible until you audit every component end-to-end.
PurviewX builds and repairs AI data pipelines for organizations whose systems should be generating intelligence but aren't. Start a conversation.