Notes

#
  • Zed Decoded: Rope & SumTree — The Zed team explained how they use B+Tree (or SumTree) to deal with file content in their text editor. I have never considered how text editors use different structures to represent and manipulate huge strings while maintaining good performance. Interesting read.

  • Why does SQLite (in production) have such a bad rep? — A short post about the controversy around SQLite, a database often perceived as unsuitable for production. The most interesting thing is the discussions originated in lobste.rs and Reddit

  • In Codd we Trust (or not) — A brief reflection on Codd’s original idea, which separates logical and physical schema, and the significant role query planners and optimizers play in modern database performance. While I understand the point of letting experienced users write their queries and execute them as they are, I can’t fully agree.

  • Some opinionated thoughts on SQL databases — Another opinionated reflection on relational databases and SQL. I agree with some points. Still, SQL is here to stay. The point that resonated more is how inefficient it is to use SQL as an API in some workloads. I’ve seen query parsing take +50% of the processing time in low latency and high throughput environments. Of course, those numbers are for very specific use cases (e.g., point queries) where the data being read is small, taking just a few milliseconds. Still, the overhead is relevant, and there is little room for improvement when using plain SQL. On the other hand, it is good to use almost all relational databases using a single API language. If you are an experienced PostgreSQL user, you can query a MySQL database without investing hours learning how the new API works.

  • Senior Engineer Fatigue — An article to justify why older people get slower at their jobs. Jokes aside, it’s a good read about why sometimes you need to slow down to get faster. And it applies not only to senior engineers!

  • What Happens When You Put a Database in Your Browser? — The browser is the new OS. A DuckDB article about running their software in browsers using Wasm. The text provides a good example of pre-visualizing parquet file schema on the fly from the browser.

#

Footnotes

  1. https://twitter.com/DRichardHipp/status/1785037995101290772

#

I’ve never invested much time in understanding how operating systems work. It has never been on my interests list, nor have I considered it useful knowledge for my career. What can I say? We all make bad decisions in our lives.

Over the last two years, I have invested a lot of time in understanding how some things work in Linux. Those are the consequences (or benefits?) of working with a complex and low-level system like a database.

This week, I was working on tracking data that was read from the page cache. While debugging some ClickHouse metrics directly collected from the Kernel, I discovered the /proc/thread-self directory1.

/proc/thread-self

When a thread accesses this magic symbolic link, it resolves to the process’s own /proc/self/task/[tid] directory.

So /proc/thread-self points to the proc folder for the current thread. In those folders, you can find a lot of helpful information. In my case, I was interested in /proc/thread-self/io 2, where you have IO statistics.

I was focused on investigating whether the Kernel reported bytes read from S3 inside rchar. I shared more information in this PR in ClickHouse repository. Despite the PR being closed, the examples I shared there still hold valuable insights and a reproducer that can contribute to the collective understanding of the system.

Footnotes

  1. There is also a /proc/self for the current process. This is something I didn’t know either.

  2. https://docs.kernel.org/filesystems/proc.html#proc-pid-io-display-the-io-accounting-fields