Dallas County Eviction Pipeline

Production data pipeline processing 40,000+ eviction records annually.

R Data Pipeline ETL Housing

40K+

records / year

12+

partner orgs

years in production

Problem

Dallas County processes roughly 40,000 eviction filings each year, among the highest counts in the country. But the data describing those filings was effectively invisible to the people who needed it most: tenants facing displacement, the legal aid organizations trying to reach them, the journalists covering the housing crisis, and the researchers studying it.

Court records sat in a county system designed for case management, not analysis. Field definitions changed without notice. Address formatting was inconsistent. There was no reliable way to aggregate filings into trend lines, map them onto neighborhoods, or join them to demographic data. Each partner organization that wanted the data was rebuilding the same cleaning logic from scratch. More often, they were just doing without it.

When CPAL committed to making housing instability a focus area, we needed a single authoritative pipeline that everyone (internal teams, external partners, the public) could build on top of.

Approach

I built the first version in R in 2020 when I joined CPAL as an analyst. The conceptual architecture has been remarkably durable across five years of scaling. The implementation hasn't been: the pipeline started as R scripts running on cron, and has since been rebuilt on Databricks (Python notebooks orchestrated through Workflows on AWS) as the system grew past what one analyst could maintain.

The pipeline does four things:

Ingest. Daily SFTP feed from the Dallas County court system, normalizing fields that change shape between releases.
Clean and geocode. Standardizes addresses, geocodes them to coordinates, joins to census tract / zip / neighborhood boundaries, and surfaces records that fail validation for human review.
Enrich. Joins demographic data from the American Community Survey to enable equity analysis (filings by race/ethnicity, income, household composition).
Publish. Outputs to CKAN for partner organizations, feeds the public-facing North Texas Evictions site, and powers internal Shiny dashboards used by CPAL teams.

The single biggest design decision was treating partner organizations as the primary users rather than incidental consumers. That shaped every choice that followed: schema stability (partners write code against our outputs), versioned data releases (so analyses don't break mid-month), documented quirks (so partners don't rediscover the same edge cases), and reliability (so legal-aid intake systems aren't pinned to a flaky upstream).

On working with sensitive data. Eviction filings are court records: technically public, ethically loaded. The pipeline publishes aggregated data at the tract level on the public site; record-level data (names, addresses, case detail) goes only to partners with executed data-use agreements who actually need it to reach tenants. We don't publish defendant names. We vet who gets record-level access. Not every requester has the operational capacity to handle this data carefully, and not every requester has good intentions in the first place. The Princeton Eviction Lab's published ethics framework is part of how I think about this.

Treating partner organizations as the primary users, rather than incidental consumers, shaped every choice that followed.

Outcome

What the pipeline does in practice: every weekday, newly-filed eviction cases get sent to local legal aid and tenant-outreach organizations, who use the data to contact filed-against tenants directly, offering support, legal advice, and connection to assistance programs. Partner orgs handle hundreds to thousands of cases a year through this work, reaching tenants who would otherwise have gone unreached. The pipeline is also the daily feed that the Princeton Eviction Lab uses for its national eviction tracking, and the foundation behind North Texas Evictions, the public transparency site advocates, residents, and reporters use to understand displacement patterns in Dallas County.

The more durable outcome is structural: 12+ organizations (legal aid, advocacy, government, journalism, academic research) now build on shared infrastructure rather than each maintaining their own brittle copies of the same court data. That's the part that compounds.

Reflection

Five years of running production data infrastructure for a small nonprofit taught me a handful of things I'd do the same way again:

Reliability is a feature, not an afterthought. The first time a partner's intake system breaks because your schema changed, you understand exactly why upstream stability matters. Versioning and clear deprecation windows beat heroics.
Document the weird stuff. Dallas court fields change definitions silently. Address parsing has corner cases that bite once a quarter. Every recurring quirk goes in a docs folder so the next analyst (or me, six months from now) doesn't burn an afternoon rediscovering it.
Build for the people downstream. When you treat partners as users rather than as people lucky to receive your output, the system gets better at the things that actually matter: schema stability, data freshness, documentation, support.
Architecture is the durable thing, not the stack. The pipeline started as R scripts running on cron. It now runs as Python notebooks on Databricks Workflows. The four-stage decomposition, the partner-as-user mindset, and the schema commitments held up across the rewrite. Stack churn is a real cost; designing for the next migration before you need it is the boring discipline that makes the migration cheap when it comes.

Tech stack

Databricks (Python notebooks, Workflows, Lakebase + Lakehouse)AWSSFTP feed (Dallas County courts)CKAN (partner data publishing)PostGIS (spatial queries)Mapbox GLAmerican Community SurveyR (legacy initial implementation)

All projects