What Happens When All the World's Open Data Lives in One Place

Open data has a discovery problem, not an access problem. When you centralize datasets from hundreds of portals, entirely new capabilities emerge: knowledge graphs that reveal hidden connections, bridge datasets that make cross-agency joins possible, and a compounding network where every new dataset makes every existing one more useful.

Riley Hilliard
Riley Hilliard
Creator of OpenData·Jan 22, 2026·11 min
Copied to clipboard

A climate researcher wants to understand how rising temperatures affect agricultural productivity in Sub-Saharan Africa. Straightforward question. The data she needs exists and is publicly available.

NASA publishes surface temperature records going back decades. The UN Food and Agriculture Organization tracks crop yields by country and year. The World Bank publishes GDP figures so she can add economic context. Three datasets, three agencies, all public.

She spends two days before writing a single line of analysis.

Not because the data is hidden. Because NASA uses ISO numeric country codes, the FAO uses its own internal identifier system, and the World Bank uses ISO alpha-3 codes. The same country has three different IDs depending on who published the data. Before she can even put these datasets in the same spreadsheet, she has to build a mapping table between three identifier systems that were never designed to work together.

This is normal. According to an Anaconda survey, data scientists spend roughly 45% of their time on data preparation and cleaning. For researchers working across agencies and countries, that number is conservative. Other studies put the broader category of “discovery and preparation” at 60-80% of total project time. The actual analysis, the part that produces insight, gets whatever time is left.

The problem isn’t that government data is hidden. It’s that it’s scattered. And governments are not great at building web portals. Each agency builds its own, with its own search interface, its own data format, its own documentation style (or lack thereof). The BLS serves directory listings that look like they were designed in 1996. The Census Bureau uses column codes like B01001_001E. FRED uses a literal dot for missing values. None of them coordinated.

You can’t search for “climate change data” and get back a useful, ranked list of the best sources. You get a verbose list of hard-to-navigate government websites. data.gov alone has 300,000+ dataset listings, but it’s a catalog of links, not a query interface. Finding the right dataset is like finding a book in a library where the card catalog was written by 50 different librarians who each invented their own filing system.

The real power of putting all open data in one place isn’t convenience. It’s that entirely new capabilities become possible. Capabilities that are literally impossible when data lives scattered across hundreds of separate portals.

Search Is the Wrong Tool for Data Discovery

Every data platform today works the same way: index the metadata, build a search bar, return ranked results. Kaggle has 50 million registered users and over 50,000 datasets, but it can’t tell you which of those datasets share a common key you could use to combine them. Google Dataset Search aggregates metadata from across the web but surfaces zero structural relationships between sources.

Search answers the question “which datasets match my keywords?” That’s rarely what someone actually needs. The real question is: “which datasets can I combine with this one, and how?”

When you search for “unemployment data,” you want more than a list of things that contain the word “unemployment.” You want to know which of these have county-level granularity. Which ones have a date column you can align with your existing data. Which ones share a geographic identifier with the Census data you’re already working with. Which ones update monthly versus annually.

These are structural questions about the data itself, not keyword questions about the metadata. Answering them requires knowing how datasets relate to each other, not just what they contain individually.

A keyword search for “unemployment” will never surface a FIPS code crosswalk table. But that crosswalk might be the single most important dataset for your project, because it’s the thing that lets you actually combine unemployment data from the BLS with health data from the CDC.

This is a graph problem.

The Dataset Graph

When all your datasets live in one place, you can build something that’s impossible on scattered portals: a knowledge graph of how every dataset relates to every other dataset.

Here’s what that looks like concretely. Take the Bureau of Labor Statistics’ Consumer Price Index, the number the news references when they say “inflation rose 3.2% this year.” We know several things about it:

  • It comes from the BLS (a specific provider)
  • It measures price changes (topics: economics, inflation)
  • It’s published at a national level, every month
  • Its columns include things like area_code, year, period, and value

Now take the Census Bureau’s American Community Survey. Different agency, different purpose, totally different format. But it also has geographic identifiers (FIPS codes for states and counties). It also has temporal data (annual). It also covers the United States.

These two datasets were created by different teams in different buildings in different decades. They don’t know about each other. But they share structural properties. And when you model those relationships explicitly (this dataset has these columns, covers this geography, belongs to this topic), something interesting happens.

The datasets stop being isolated search results. They start forming a network. A connected web where you can navigate from one dataset to related datasets through their actual structural relationships, not just through keyword matches.

             ┌─────────────┐
             │   Topic:     │
             │  Economics   │
             └──────┬───────┘

    ┌───────────────┼───────────────┐
    │               │               │
    ▼               ▼               ▼
┌────────┐    ┌──────────┐    ┌──────────┐
│BLS CPI │    │ FRED GDP │    │World Bank│
│Monthly │    │ Quarterly│    │ Annual   │
└───┬────┘    └────┬─────┘    └────┬─────┘
    │              │               │
    ▼              ▼               ▼
┌────────┐    ┌──────────┐    ┌──────────┐
│area_code│   │ country  │    │iso_code  │
│(geo ID) │   │ (geo ID) │    │(geo ID)  │
└───┬─────┘   └─────┬────┘    └────┬─────┘
    │               │               │
    └───────┬───────┘───────┬───────┘
            ▼               ▼
       ┌──────────────────────────┐
       │  FIPS / ISO Crosswalk    │
       │  (bridge dataset)        │
       └──────────────────────────┘

The three datasets at the top all share a topic (economics), but they use different geographic identifiers. BLS uses area codes. FRED uses country names. The World Bank uses ISO codes. The crosswalk table at the bottom is just a mapping file. It’s not interesting because of what it contains. It’s interesting because of where it sits in the network: it’s the bridge that connects three datasets from three different providers that would otherwise have no relationship to each other.

No search engine can tell you this. The crosswalk table doesn’t mention “CPI” or “GDP” in its metadata. It wouldn’t appear in any keyword search for economic data. But the graph reveals it as one of the most structurally important datasets in the catalog.

Bridge Datasets: The Invisible Infrastructure

Every network has nodes that sit at critical connection points. In a social network, these are the people who connect otherwise separate friend groups. The person who knows people in both the engineering team and the marketing team. Remove that person and two groups that used to be connected are suddenly isolated.

Dataset networks have the same structure. Some datasets sit at critical junctions, connecting clusters of data that would otherwise have no path between them. These are the reference tables and crosswalk files that make it possible for datasets from different agencies to talk to each other.

There’s a way to measure this. You count how often a dataset lies on the shortest path between any two other datasets in the graph. (In network science, this is called betweenness centrality, but the intuition is simpler than the name: it’s a measure of how much of a “bridge” something is.) When you run this measurement, you discover something surprising.

The most structurally important datasets in the catalog are often the most boring ones.

DatasetWhat It ContainsJoins It Enables
FIPS Code CrosswalkState/county codes mapped to names47 cross-provider joins
ISO Country CodesCountry identifiers across standards34 international dataset joins
NAICS Industry CodesIndustry classification hierarchy22 economic dataset joins
BLS Area CodesBLS-specific geography mapping18 labor data joins

FIPS code crosswalks. ISO country code mappings. NAICS industry classification tables. Nobody searches for these. Nobody writes papers about them. They wouldn’t show up on anyone’s “top 10 most interesting datasets” list. But they’re the connective tissue that makes everything else combinable.

The FIPS crosswalk is structurally the most important dataset in the catalog. Without it, Census data and BLS data and HUD data all exist in separate universes, despite all describing the same states and counties.

This has real practical implications. If the FIPS crosswalk is missing from your catalog, the graph reveals a gap: two large clusters of datasets that should be connected but aren’t. The system can automatically suggest bridge datasets when you need them: “These two datasets don’t share a direct key, but both connect through FIPS codes.” And if a bridge dataset breaks or goes stale, you know the blast radius immediately, because the graph tells you everything that depends on it.

Multi-Hop Joins: Connecting the Unconnectable

This is where it gets fun.

A researcher has BLS unemployment data by county. She wants to combine it with CDC health outcomes data. These two datasets have nothing obvious in common. Different agencies, different schemas, different column names. In the old world, you’d either give up or spend a week figuring out the plumbing.

But the graph knows something you don’t. Both datasets contain geographic identifiers. The BLS data has a column called area_code. The CDC data has a column called state_fips. And there’s a FIPS crosswalk dataset that maps between them.

That’s a two-hop join. Think of it as a chain:

Unemployment data connects to FIPS Crosswalk (through area_code), and FIPS Crosswalk connects to CDC health data (through state_fips).

The system discovers this path automatically. The researcher didn’t have to know the crosswalk existed. She didn’t have to know which column names to match. The graph found the path, verified that the column types are compatible, and can report the confidence level of the connection.

Now extend it. What else connects to unemployment through that same FIPS crosswalk?

Direct joins (1 hop): Census demographics (through FIPS codes), FRED economic indicators (through date columns), BLS wage data (through area codes).

Bridge joins (2 hops): HUD housing data (unemployment to FIPS to HUD fair market rents), CDC mortality data (unemployment to state FIPS to WONDER mortality records), Department of Education data (unemployment to FIPS to College Scorecard).

A researcher starts with one dataset and the graph fans out into an entire landscape of combinable data. This is the shift from “searching for data” to “navigating data.” You don’t type keywords into a search box. You start somewhere and follow connections. Each hop reveals new datasets you didn’t know existed, connected through paths you wouldn’t have found manually.

A data scientist might look at this and think about the graph traversal algorithms underneath. A journalist might look at this and think “I can start with a county-level crime dataset and find out what else I can layer on top of it, without becoming a data engineer.” Both are right.

Communities That Emerge from the Data Itself

When you build a data catalog, you assign categories. Economics. Health. Demographics. Environment. These categories reflect how humans organize knowledge, and they’re useful. But they’re also static and somewhat arbitrary.

When you let the graph’s structure speak for itself (using algorithms that find clusters of densely connected datasets), the groups that emerge tell a different story.

You might expect housing datasets to cluster with demographics. Housing is a classic demographic indicator, and that’s where a human librarian would file it. But the graph might reveal that housing datasets are more densely connected to financial datasets: mortgage rates from FRED, bank lending data from the FDIC, Treasury interest rate data. The structural cluster is “the financial ecosystem around homeownership,” not “demographic statistics about where people live.”

Both framings describe the same data. But they point researchers in completely different directions. The structural cluster reflects actual analytical utility, because it’s defined by which datasets can actually be combined with each other, not by which category label someone assigned.

This means the catalog organizes itself. As new datasets are added, the communities shift. A new HUD homelessness dataset might pull the housing cluster closer to the health cluster, because homelessness data shares geographic keys with CDC data. A new agricultural trade dataset might create a bridge between USDA crop data and Treasury economic data that didn’t exist before.

The taxonomy is static, assigned once by humans. The community structure is dynamic, revealed by the data. And it gets more interesting as the catalog grows, because every new connection can reshape which clusters exist and how they relate to each other.

The Compounding Effect

Here’s the most important property of a centralized dataset graph: it compounds.

When the catalog has 50 datasets, the graph might contain 500 potential connections. Add 50 more and you don’t get 500 additional connections. You might get 2,000. Because the new datasets connect not just to each other but to every existing dataset they share geographic identifiers, temporal alignment, or entity references with.

A new country-level economic dataset from the IMF doesn’t just add one node to the graph. It creates potential connections to every existing dataset that contains country identifiers: World Bank, UN, FRED, Treasury. If one of those connections passes through a bridge dataset, it also creates indirect paths to every dataset on the other side of that bridge.

This is why the thousandth dataset is astronomically more useful than the first. The first dataset connects to nothing. The thousandth dataset has a thousand other things it might connect to. Every dataset added makes every existing dataset marginally more useful, because it’s one more potential node in a join path, one more potential member of a structural community, one more potential answer to someone’s question.

OpenData currently has 200+ datasets from 30+ providers. As that number grows, the graph doesn’t just get bigger. It gets denser. The ratio of connections to nodes increases. Clusters that were isolated start to touch. Bridge datasets that connected two clusters start connecting five.

This is why centralization matters. Not for control. For connection. You can’t build a knowledge graph across hundreds of separate portals. You need the data in one place so the relationships can be computed, the bridges can be detected, and the paths can be traversed. The graph is an emergent property of having everything together.

What You Can Actually Ask

All of this theory is only interesting if it changes what you can do in practice. Here are questions that become answerable when data lives in a connected graph instead of scattered across portals.

“What data exists about California at county level?”

This isn’t a keyword search. It’s a structural query: find every dataset that actually contains county-level California data. Not datasets that mention the word “California” in their description, but datasets whose rows include California counties, verified by the data itself.

“I have unemployment data. What else can I combine it with?”

The graph fans out from your starting point. Census demographics connect through FIPS codes. FRED economic indicators connect through date alignment. HUD housing data connects through a two-hop bridge. You see the full analytical landscape around your dataset, ranked by connection strength.

“How does climate data connect to economic data?”

Multiple paths emerge. Temperature records connect to World Bank GDP through country codes. State-level temperature data connects to BLS employment through FIPS codes. Agricultural yield connects to trade data through FAO codes. Each path tells a different story about the climate-economics relationship. The researcher picks the one that matches her question.

“What breaks if BLS changes their area code format?”

Impact analysis through the graph: trace every downstream dependency. Every dataset that joins through BLS area codes, every bridge that relies on them, every multi-hop path that passes through them. You know the blast radius before anything breaks, not after.

“What’s missing?”

The graph reveals gaps. Two large clusters of datasets that should be connected (say, environmental data and health data) but aren’t, because there’s no bridge dataset mapping EPA site codes to CDC geographic identifiers. The absence is visible in the graph structure. You know exactly which bridge dataset to add next.

From Search to Navigation

The shift is from searching for data to navigating it. From isolated datasets to a connected network. From “find me something about unemployment” to “show me everything that connects to unemployment, how confident those connections are, and what bridge datasets I’d need to get from here to health outcomes.”

The data already exists. The algorithms exist (betweenness centrality, community detection, shortest-path traversal are well-studied, efficient, and available in any graph database). What’s been missing is the structure connecting them. A knowledge graph for open datasets is that structure. And the intelligence it reveals (the bridges, the communities, the paths, the gaps) is genuinely new information about how the world’s public data fits together. Information that no individual dataset contains, but that the network makes visible the moment you look at it as a whole.

Nobody sat down and decided the FIPS code crosswalk is the most important dataset in the catalog. The graph figured that out from the topology. Nobody manually linked BLS unemployment to CDC health outcomes. The two-hop path through geographic identifiers exists whether or not a human noticed it. Nobody hand-assigned datasets to communities. The algorithm found them from connection density.

That’s what happens when you put the world’s open data in one place. The relationships between datasets become computable. And the questions you can ask stop being limited to “does this dataset exist?” and start being “how does everything connect?”

We’re building this at OpenData (source on GitHub). The catalog is growing, the graph is getting denser, and the join paths are getting more interesting every week. If you work with public data and you’ve spent time doing the plumbing work of connecting datasets by hand, you already know the problem. The graph is the fix.

Riley Hilliard
Riley Hilliard

Creator of OpenData

At 13, I secretly drilled holes in my parents' wood floor to route a 56k modem line to my bedroom for late-night Age of Empires marathons. That same scrappy curiosity carried through 3 acquisitions, 9 years as a LinkedIn Staff Engineer building infrastructure for 1B+ users, and now fuels my side projects, like OpenData.

Copied to clipboard

More from OpenData

Why Your Charts Don't Get Shared (And Chartr's Do)

Chartr grew to 500K+ subscribers by making data visualization shareable. What they figured out about headline-first framing, minimal chrome, and social optimization applies to anyone making charts.

Riley HilliardRiley Hilliard·Mar 26, 2026

Store Flat, Transform on Read

Why we store all data in long format and apply transforms at query time instead of pre-computing views. A technical deep dive into DuckDB, Parquet, and the architecture behind OpenData's query engine.

Riley HilliardRiley Hilliard·Mar 19, 2026

70% of AI Training Datasets Have the Wrong License

A large-scale audit found that over 70% of popular AI datasets have missing or wrong license metadata. With the EU AI Act now enforcing training data transparency, this isn't just sloppy. It's a liability.

Riley HilliardRiley Hilliard·Mar 12, 2026

Public Data Has a Discovery Problem

Government data is technically public but practically inaccessible. Here's what that actually costs researchers, journalists, and anyone trying to answer a question with data.

Riley HilliardRiley Hilliard·Mar 5, 2026

Welcome to OpenData Labs

Introducing OpenData Labs. We'll be sharing project updates, deep dives into open data infrastructure, and lessons learned building a platform for public datasets.

Riley HilliardRiley Hilliard·Feb 25, 2026

The Hidden Mess Inside 'Clean' Government Data

Government data has a reputation for being clean and reliable. Anyone who's tried to ingest it programmatically knows that's not the full story. Here are the real encoding quirks, format traps, and silent failures hiding in data from FRED, BLS, Census, the World Bank, and the EPA.

Riley HilliardRiley Hilliard·Feb 19, 2026

The State of Open Data Infrastructure in 2026

A survey of the open data landscape: what data.gov, Socrata, FRED, Kaggle, Hugging Face, and Datasette do well, what's still broken, and where the connective tissue between data sources is finally being built.

Riley HilliardRiley Hilliard·Feb 12, 2026

Building a Headless Visualization Engine

How we separated chart computation from rendering by building a spec-driven visualization engine. The architecture behind @opendata/viz: four packages, a compilation pipeline, and zero DOM dependencies in the math layer.

Riley HilliardRiley Hilliard·Feb 5, 2026

Bootstrapping a Data Platform on Two Mac Minis

OpenData runs in production on two Mac Minis at $0/month infrastructure cost. Here's the architecture, the tradeoffs, and the specific triggers that would move us to cloud.

Riley HilliardRiley Hilliard·Jan 29, 2026

Curious about open data? Start exploring.

OpenData makes public datasets discoverable, consistently formatted, and queryable without the usual headaches.

Try it out
  • Browse thousands of public datasets
  • Query any dataset with a simple API
  • Download as CSV, JSON, or Parquet