The challenge
A distribution company needed to enrich its property database with owner contact information. They had 100,000+ property records with addresses but no reliable way to reach the owners. Manual research was too slow. Existing data vendors were too expensive. They needed a scalable, cost-effective pipeline that could process their entire database and keep it current.
The engagement
We embedded with the operations team to understand the data landscape. The property records came from multiple sources with inconsistent formatting — different address conventions, varying levels of completeness, duplicate records from overlapping sources.
Building the pipeline
The enrichment pipeline used a multi-source approach: geocoding for address normalization, public records for ownership data, and targeted API calls for contact information. Each step was designed to filter before spending — the cheapest operations run first to eliminate records that don't need expensive enrichment.
We cached every API result locally before writing to the production database. This wasn't optional caution — it was a hard rule. If the database write failed, we could replay from cache at zero additional API cost.
The mistakes
Six weeks in, we discovered two significant bugs.
Bug 1: The dedup query. Our deduplication logic was matching on city name strings across tables with different naming conventions. "Portland" didn't match "PORTLAND" didn't match "Portland, OR." Records that should have been flagged as duplicates were passing through as unique, generating redundant API calls.
Bug 2: Name parsing. Our name parser was reading first and last names backwards for a subset of records where the source data used "Last, First" format. We were enriching records with reversed names, generating contacts that would be addressed incorrectly.
The response
We didn't minimize these bugs. We quantified them.
The dedup bug accounted for 19% of total API spend — duplicate records sent to paid APIs that should have been caught. We calculated the exact dollar amount, broken down by batch, with a timeline showing when the bug was introduced and when it was caught.
The name parsing bug affected approximately 8% of enriched records. We identified every affected record, corrected the data, and re-ran enrichment where the reversed names had produced incorrect contact matches.
We presented all of this to the client in a dedicated meeting. Not buried in a status update. A full presentation with root cause analysis, financial impact, fixes implemented, and safeguards added.
The safeguards
After the bugs, we added layers of protection:
- ZIP-based matching replaced city name matching for deduplication. ZIP codes are unambiguous.
- Python-side set checks before every API call. Even if SQL dedup missed something, the application layer catches it.
- Yield monitoring per batch. If the ratio of new records to total records drops below expected thresholds, the pipeline stops automatically. Zero-yield batches trigger an immediate alert.
- Name format detection that identifies "Last, First" patterns before parsing, handling both formats correctly.
The result
After fixes, the pipeline processed 100,000+ properties at an average cost of $0.05 per enriched contact. The safeguards we added after the bugs have prevented any similar waste in subsequent runs.
The client didn't just continue the engagement — they expanded it. New property databases, new geographic regions, new enrichment sources. The failure documentation became the foundation of trust that made expansion possible.
Lessons
- 19% of spend was preventable waste. We quantified every dollar. Hiding waste doesn't make it disappear. Quantifying it makes it fixable.
- The client expanded after seeing the failure report. Transparency builds more trust than perfection. Clients who see you handle failure well trust you to handle success accurately.
- SQL string matching across different sources is always broken. Different systems format data differently. Always normalize before comparing. Always add a second layer of verification.
- Cache before you spend. Every paid API result gets cached locally before any downstream processing. This isn't just good engineering — it's financial insurance. A $0 replay is always better than a repeat API call.
- Transparency is a better sales tool than a pitch deck. We didn't expand this engagement by pitching. We expanded it by proving we could be trusted with honest reporting.