PostgreSQL Anonymizer

postgresql-anonymizer.readthedocs.io

289 points by chynkm 10 months ago

gkbrk 10 months ago

Clickhouse has something similar called clickhouse-obfuscator [1]. It even works offline with data dumps so you can quickly prepare and send somewhat realistic example data to others.

According to its --help output, it is designed to retain the following properties of data:

- cardinalities of values (number of distinct values) for every column and for every tuple of columns;

- conditional cardinalities: number of distinct values of one column under condition on value of another column;

- probability distributions of absolute value of integers; sign of signed integers; exponent and sign for floats;

- probability distributions of length of strings;

- probability of zero values of numbers; empty strings and arrays, NULLs;

- data compression ratio when compressed with LZ77 and entropy family of codecs;

- continuity (magnitude of difference) of time values across table; continuity of floating point values.

- date component of DateTime values;

- UTF-8 validity of string values;

- string values continue to look somewhat natural

[1]: https://clickhouse.com/docs/en/operations/utilities/clickhou...

bux93 10 months ago

The Dutch national office of statistics has tools intended to de-identify 'microdata' such that k-anonimity[1] is achieved called mu-argus[2] and tau-argus.
[1] A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. https://en.wikipedia.org/wiki/K-anonymity [2] https://research.cbs.nl/casc/mu.htm
- aeontech 10 months ago
  
  This is really cool, and deserves a submission of its own, I'd say!
JosephRedfern 10 months ago

There's a write up from Alexey of different approaches considered for clickhouse-obfuscator here: https://clickhouse.com/blog/five-methods-of-database-obfusca....
The summary is pretty funny:
> "After trying four methods, I got so tired of this problem that it was time just to choose something, make it into a usable tool, and announce the solution"

phoronixrly 10 months ago

I have some experience with the 'Masking Views' functionality. If you are going to rely on it and specifically in a Rails app, know that it is against conventions and thus is generally inconvenient. This may be the same with any other framework that features DB schema migrations.

More specifically the integration of this functionality at a fortunately ex-employer was purposefully kept away from the dev team (no motivation was offered, however I suspect that some sort of segmentation was sought after) and thus did not take into account that tables with PII did in fact still need their schema changed from time to time.

This lead to the anonymizer extension, together with the confidential views to only be installed on production DB instances with dev, test, and staging instances running vanilla postgres. With this, the possibility to catch DB migration issues related to the confidential views was pushed out to the release itself. This lead to numerous failed releases which involved having the ops team intervene, manually remove the views for the duration of the release, then manually re-create them.

So,

If you plan to use this extension, and specifically its views, make sure you have it set up in the exactly same way on all environments. Also make sure that its initialisation and view creation are part of your framework's DB migrations so that they are documented and easy to precisely reproduce on new environments.

daamien 10 months ago

PostgreSQL Anonymizer developer here : the problem you are describing here only affects version 1.x
Version 2.0 was released a few days ago with a branch new masking system that does not block database migrations.

riskable 10 months ago

One of the best ways to handle this sort of thing is to put things like PII in a separate database entirely and replace it with a token in the "main" database. When something like PII actually needs to be retrieved you first retrieve the token and then search the other database for said token to get the real data.

It certainly complicates things but it provides an additional security layer of separation between the PII and it's related data. You can provide your end users access to a database without having to worry about them getting access to the "dangerous" data. If they do indeed need access to the data pointed to via the token they can request access to that related database.

This method also provides more performance since you don't need to encrypt the entire database (which is often required when storing PII) and also don't need to add extra security context function calls to every database request.

VWWHFSfQ 10 months ago

The is basically just a foreign database key which, in most cases, is not sufficient to satisfy industry and regulatory requirements for anonymization and storage of PII.
estreeper 10 months ago

A nice tradeoff in many cases is to have a separate schema rather than a separate database, which allows preserving referential integrity and using the database’s RBAC to restrict access to the schemas. This also means things like cascading deletes can still work.
bflesch 10 months ago

This is a clever approach, thanks for sharing.
Databases like Postgres come with option to use roles and different user accounts, yet many applications have a single user who can see and do everything.
It's overall good practice to compartmentalize data and tasks (e.g. reading vs db schema upgrades vs admin access via web interface etc)
ComputerGuru 10 months ago

Eh. I get your point and truly appreciate structural safeguards as opposed to aspirational ones but this is really not as doable as you make it out to be, and doing it properly would be a full blown product in its own right. First, this only works if you have a very narrow interpretation of PII. Once you realize most of your non-int/uuid unique indexes (and all your join predicates) are probably PII in some way or the other, the scope of the problem greatly increases. How does your solution work when you need to group by PII, full text search by PII, filter by PII, etc?
10000truths 10 months ago

What's the difference between this and a user ID that's not publicly exposed?

foreigner 10 months ago

TIL that PostgreSQL has SECURITY LABEL! It seems like this could be useful for storing all sorts of metadata about database objects, not just security stuff. E.g. like the COMMENT but not global. From reading the docs it looks like you need a "label provider" to get it to work though. I can only seem to find a few label providers around, does anyone know of one that isn't security/anonymization related and could be used more generically?

michelpp 10 months ago

Yep, it's a completely generic feature, and I'd argue the keyword 'SECURITY' is a noise word and 'LABEL' by itself would be even better, but that's what we're stuck with. As the documentation states: " PostgreSQL places no restrictions on whether or how a label provider must interpret security labels; it merely provides a mechanism for storing them."

gmassman 10 months ago

This is a very handy postgres extension! We've been using it at my job for a couple years now to generate test datasets for developers. We have a weekly job that restores a prod backup to a temporary DB, installs the `anon` extension, and runs pg_dump with the masking rules. Overall we've been very happy with this workflow since it gives us a very good idea of how new features will work with our production data. The masking rules do need maintenance as our DB schema changes, but that's par for the course with these kinds of dev tools.

All that said, I wouldn't rely on this extension as a way to deliver anonymized data to downstream consumers outside of our software team. As others have pointed out, this is really more of a pseudonymization technique. It's great for removing phone numbers, emails, etc. from your data set, but it's not going to eradicate PII. Pretty much all anonymized records can be traced back to their source data through PKs or FKs.

daamien 10 months ago

Pseudonymizing functions are just one way to mask the data.
There are many other masking functions that will actually anonymize the data.
And the extension does not force you to respect the foreign keys.
It's really up to you to decide how you want to implement your masking policy

joram87 10 months ago

I've been working on something similar, starting a company around the idea! We realized that a lot of people had concerns or challenges with installing an extension on their production database and also that they wanted non-technical folks in compliance or HR to be able to configure and maintain the rules for individual employees. pgAnonymizer is a database extension but we structured ours to be a proxy server that hides/anonymizes/filters the data. We made a web dashboard that simplifies the configuration process, and allows you to configure what to do if columns get added to the database (default mask or hide new columns). We're about to go GA and if anyone has any feedback or wants to a free beta testing trial, I'd love to chat

joram87 10 months ago

I probably should mention the name of the tool I'm working on... oops, it's VeilStream :)

nickzelei 10 months ago

This can work pretty well if you want to either mask the data in prod or update it in place.

A good use case that comes to mind is using prod data in a retool app or something for your internal team but you want to mask out certain bits.

I’ve been building Neosync [1] to handle more advanced use cases where you want to anonymize data for lower level environments. This is more useful for stage or dev data. Then prod stays completely unexposed to anyone.

It also has a transactional anonymization api too.

[1]: https://github.com/nucleuscloud/neosync

dandiep 10 months ago

This is a fantastic idea. Now how to get it on RDS…

hylaride 10 months ago

It can be done manually:
https://dba.stackexchange.com/questions/306661/how-to-instal...
- daamien 10 months ago
  
  PostgreSQL Anonymizer developer here : this was possible with version 0.8 which is now completely unmaintained.
  This extension is currently not available on RDS but it is available on many others DBaaS providers : Azure SQL, Google Cloud SQL, Crunchy Bridge, ....
zdc1 10 months ago

Assuming if it's for a support team or internal users with a lower SLA, I wonder if it's possible to have a small self-hosted PostgreSQL server that basically acts as a shim by holding a foreign-data wrapper connection to the actual RDS instance
- daamien 10 months ago
  
  This called "Masking Data Wrappers"
  https://postgresql-anonymizer.readthedocs.io/en/stable/maski...
debarshri 10 months ago

In RDS, if you cannot use this, you can create masked view and use query rewrite to make it work.
In my experience PG anonymizer has performance issues when it comes to large queries.
- daamien 10 months ago
  
  Version 2.0 was released a few days ago. It's a complete rewrite in Rust.
  Performance should be better than with v1.x
willgdjones 10 months ago

+1 for RDS support. I have wanted to use this for a while in our production systems. reply
- hylaride 10 months ago
  
  It can be done manually:
  https://dba.stackexchange.com/questions/306661/how-to-instal...
  reply
symfoniq 10 months ago

Same. Lack of RDS support is the only reason we aren’t using this.
- hylaride 10 months ago
  
  It can be done manually:
  https://dba.stackexchange.com/questions/306661/how-to-instal...
  reply
  - daamien 10 months ago
    
    PostgreSQL Anonymizer developer here : this was possible with version 0.8 which is now completely unmaintained. This extension is currently not available on RDS but it is available on many others DBaaS providers : Azure SQL, Google Cloud SQL, Crunchy Bridge, ....
- edrenova 10 months ago
  
  Just to jump in here -> We support RDS + more and you can self-host, Neosync.
  https://github.com/nucleuscloud/neosync
  (I'm one of the co-founders)
  - BoorishBears 10 months ago
    
    I tried to figure out how/if this does what I need and your README had no examples. I clicked a couple of level deep, found no obvious demonstrations and left.
    I checked the homepage but I do not watch Loom-style demos personally, definitely not 5 minute ones, and so I left.
    -
    When I click on OP's link, or just search for it on Google, it takes less than a full page for the extension to show me an extremely straightforward demonstration of its value. You should have something like that.
    A simple example of what queries will look like, what setup will look like, all concisely communicated, no 5 minute lectures involved.

sam0x17 10 months ago

I was actually tasked with building essentially this same thing back in 2014 when I was a junior dev for a fintech startup. They needed an anonymized version of prod database suitable for support team to pull up when trying to reproduce bugs. Did this gigantic thing that would stream the db dump into a C++ app and anonymize it on the fly. Took a similar approach to their masking they do here. Fun project. Company should have productized it.

nickzelei 10 months ago

This is the exact usecase that we are building for with Neosync (https://github.com/nucleuscloud/neosync)
- sam0x17 10 months ago
  
  awesome, feel free to shoot me an email if you want to connect!

Cynddl 10 months ago

I'm going to repeat myself as I do everytime I encounter such tools. These tools DO NOT provide anonymization, and especially not at the level required by the EU's GDPR (where the notion of PII does not exist).

As a computer scientist and academic researcher having worked on this topic for now more than a decade (some of my work if you are interested: [1, 2]), re-identification is often possible from few pieces of information. Masking or replacing a few values or columns will often not provide sufficient guarantees—especially when a lot of information is being released.

What this tool does is called ‘pseudonymization’ and maybe, if not very carefully, ‘de-identification’ in some case. With colleagues, reviewed all the literature and industry practices a few months ago [3], and our conclusion was:

> We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.

This is clearly not what this tool is doing.

[1] https://www.nature.com/articles/s41467-019-10933-3 [2] https://www.nature.com/articles/s41467-024-55296-6 [3] https://www.science.org/doi/10.1126/sciadv.adn7053

daamien 10 months ago

Of course there's no perfect solution for anonymizing a dataset...
The extension offers a large panel of masking functions : some are pseudonymizing functions but others are more destructive. For instance there's large collection of fake data generators ( names, address, phones, etc. )
It's up to the database administrator or the application developer to decide which columns need to be masked and how it should be masked.
In some use cases, pseudonymization is enough and others anonymization is required....
aquilaFiera 10 months ago

Seems like if you're doing static masking and you mask enough data, this works just great. Am I missing something?

pgryko 10 months ago

Are these tools able to automatically identify PII information or do you have to specify columns and data types manually? What happens if you have PII data in a string field? Do you just rely on something like spacy to identify the PII data?

graindcafe 10 months ago

From the website:
> The project has a declarative approach of anonymization.
> Finally, the extension offers a panel of detection functions that will try to guess which columns need to be anonymized.
https://postgresql-anonymizer.readthedocs.io/en/stable/detec...

riffraff 10 months ago

this seems great. I wonder tho, how do you ensure new columns are masked by default? It seems a safer alternative would be to start with all columns being statically masked and only unveil them selectively.

I guess you can add some CI steps when modifying the db to ensure a give column is allowed or masked, but still, would be nice if this was defaulted the other way around.

daamien 10 months ago

This is called "Privacy By Default"
https://postgresql-anonymizer.readthedocs.io/en/stable/priva...

daamien 10 months ago

Hi !

I'm the main developer of this extension. Happy to answer any question you have about this project and anonymization in general!

sgt 10 months ago

Just be careful that you don't anonymize your production data.

lovasoa 10 months ago

The principle of the software seems to be that the original data is never altered. It is a postgres extension that "masks" the data for certain postgres users. You can always connect as the root user and see everything when you need to.
- heeton 10 months ago
  
  It allows updating the original data - https://postgresql-anonymizer.readthedocs.io/en/stable/stati...
  > These methods will destroy the original data. Use with care.
  - sgt 10 months ago
    
    So basically running SELECT anon.anonymize_database(); will do it.
    
    daamien 10 months ago
    
    Both approach are possible
    * Static Masking will destroy the authentic data once for all
    * Dynamic Masking will only alter the data the "masked users". Regular users will continue to view the real data.
    Both techniques have their own advantage depending on your context.
go_prodev 10 months ago

With great power comes great responsibility.
antman 10 months ago

Masking is for view for specific users