MCP + DP + PCN =

Data Sharing Nirvana?

Bennett Hillenbrand
President, Working Paper
Andrew Gruen, PhD
CEO, Working Paper
Senior Fellow, Future of Privacy Forum
James Honacker, PhD
Co-Founder, Augusta Griffin
Sharon Gibbons
Co-Founder, Augusta Griffin

“What do LLMs actually do?”

Summarization? Translation?

Also: Why does MCP exist?

Challenge:

Privacy safe data has tremendous promise but the operationalizing of access has always had limitations.

<--Privacy/utility-->

Differentially private data is the gold standard for privacy preserving data access, which we usually

Apply in one of two ways:

  1. As a pre-computed set of aggregates (census 2020, examples)
  2. As a differentially private query engine where a user can write queries (SQL or equivalent) and receive back differentially private data.
    • Not Pareto optimized
    • Privacy budget is consumed ad-hoc
    • Requires data querying and analysis skills.

In all cases, differentially private outputs or query engines have required analysis skills.

With modern AI tools, we can add a natural language layer on top of the differentially private data

However: we still need to protect the underlying microdata, manage the privacy budgets, and ensure the provided data is real data with noise perturbation... rather than hallucination.

This guides us to a solution of MCP + DP.

Which still does not solve for the problem of potential hallucination

How do we know if we are right?

Proof Carrying Numbers

provide “verification instead of description. It doesn’t replace metadata or statistical standards; it complements them by automatically checking whether an AI system’s numerical outputs remain faithful to the reference data.”

Wait. What?

Verifying LLM Output

Against the underlying data

“What matters is not how often a model is right, but whether users are informed when it might be wrong.” - Aivin Solatorio

How this works

  • Claim IDs
  • A policy that governs the verification process
  • And a prompt instruction about how to surface empirical information and associate (or fail to associate) the relevant claims.

PCN means verification without disclosure

With our PCNs, we can know that a number produced by the LLM is valid, then inject noise under DP, providing a privacy safe result based on the natural language query.

Implicit challenge:

Does this mean we need to precompute every aggregate a user may be interested in?

Nope!

The owner of the data decides what to do

And the user has a choice

The owner can reject non-verified results... But if the owner chooses to display, the user can reject if they don't want to spend their budget.

1

Surface results, but with an indicator: Doesn't exist? OR Negatively impacted? (e.g. improper rounding)

2

provide the differentially private answer with the flag,

3

don’t provide any answer at all, or

4

let the user know there will be an issue with the data but their privacy budget will still be impacted and ask them if they want to proceed.

OK, but actually...

Doesn’t this still run afoul of needing to precompute the aggregates?

Yes and no.

Use Cases:

  1. All government administrative data everywhere
  2. Nearly every quantitative social science dataset ever produced
  3. Social Science One
  4. Raj Chetty social mobility - can we get schools access to grade level socio-economic information (e.g. from the IRS)
  5. Aggregates of individual medical data across providers?

Research Agenda

1

Prototype build out of the PCN round-trip service

2

Trustworthiness of responses as a function of system design

3

Rebuild existing data release use cases with this architecture

4

Privacy budget consumption rates with natural language vs SQL-like interfaces