Beyond indexes: How open table formats optimize query performance

eisa01 2 days ago

A lot of upvotes, but no discussion :)

How would you approach migrating a ~5 TB OLTP database that fundamentally constain analytical time series data? I’d think e.g., Apache Iceberg could be a better data store, and make writing much easier (almost just dump a parquet in there)

It’s exposed to the outside world via APIs

marklit 2 days ago

That 5 TB of data will probably be 3-400 GB in Parquet. Try and denormalise the data into a few datasets or just one dataset if you can.
DuckDB querying the data should be able to return results in milliseconds if the smaller columns are being used a better if the row-group stats can be used to answer queries.
You can host those Parquet files on a local disk or S3. A local disk might be cheaper if this is exposed to the outside world as well as giving you a price ceiling on hosting.
If you have a Parquet file with billions of records and row-groups measuring into the thousands then hosting on something like Cloudflare where there is a per-request charge could get a bit expensive if this is a popular dataset. At a minimum, DuckDB will look at the stats for each row-group for any column involved with a query. It might be cheaper just to pay for 400 GB of storage with your hosting provider.
There is a project to convert OSM to Parquet every week and we had to look into some of those issues https://github.com/osmus/layercake/issues/22

victor106 2 days ago

Great article.

Any other resources that provide a comprehensive treatment of open table formats?