Andrey Borodin



Tags:   postgresql    russia    cloud    yandex    gsoc    wal-g    odyssey    indexing   
Category:   Interviews   
Interviewed by: Andreas Scherbaum

PostgreSQL is the World’s most advanced Open Source Relational Database. The interview series “PostgreSQL Person of the Week” presents the people who make the project what it is today. Read all interviews here.

Please tell us about yourself, and where you are from.

I’m from Russia. I was born in a village called Sarana and eventually moved to the nearby city Ekaterinburg in the Ural region. Technically it’s a border of Europe and Asia, close to the median between Beijing and Berlin.

Andrey Borodin

Andrey Borodin

How do you spend your free time? What are your hobbies?

My hobby is teaching. I’m an associate professor at Ural Federal University where I teach courses “Resilient Distributed Systems” and “Data Access Methods” (both are Postgres-based). At Yandex Data School I’m involved in an “Algorithms and Data Structures” course.

Any Social Media channels of yours we should be aware of?

Last book you read? Or a book you want to recommend to readers?

I’ve recently finished “Harry Potter and the Methods of Rationality“ somewhat like “open source Harry Potter” :)

Now I’m reading “Time Travel and Other Mathematical Bewilderments” by Martin Gardner, this book can save you time for watching a lot of fantastic movies.

In fact I’m a really slow reader, but I like to be near a huge bookstack. Office manager at Yandex occasionally asks me if I’m going to return all my book towers to the corporate library. And I return something, but see more interesting books in the library and take them with me. Last time I was there I grabbed “On Transactional Concurrency Control” by Goetz Graefe et al, but did not start it yet.

Any favorite movie, or show?

So far I only enjoyed “Well, Just You Wait!”, “South Park” and Postgres.tv.

How would your ideal weekend look like?

When I have enough spare time, I would be happy to run 21km. Preferably with my son, but currently this is too much for him, even on a bike. Also it’s super cool to have an uninterrupted and focused coding or patch review after the launch. Somehow it’s really hard to get lots of uninterrupted time in the office. In the evening I like to heat up my banya (a kind of sauna). And, perhaps, It would be great to see my friends more often.

What’s still on your bucket list?

Learning tmux and vim properly. If only I could pay someone to do this instead of me. I always have some tasks with higher priority, even in the learning queue.

What is the best advice you ever got?

  • Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. — B. Kernighan
  • Five second fuses only last three seconds.
  • Don’t design an asynchronous system to rely on timeouts.

When did you start using PostgreSQL, and why?

In 2011 I was designing an ML pipeline for customer churn prediction for a local ISP. Their billing was built on PG 9.1. I started with hand-written C# data transformation, but was amused by DML clearness, connectivity of extensions, and the overall performance of the PG-implemented parts. In further projects I was always voting for choosing PG as the RDBMS engine.

Yes, I hold a PhD in CS. My research project was about spatial indexing and I’m still porting ideas from my thesis to PostgreSQL. But I identify myself as an engineer, not as a scientist. In most cases I prefer WD40 and duct tape to tedious scientific methods.

What other databases are you using? Which one is your favorite?

I’m in the team developing database-as-a-service managed cloud solutions for PG, MySQL, MS SQL and Greenplum.

PostgreSQL has the best codebase to learn from. Rich ecosystem of drivers, extensions, tools etc. For every technical problem you have a solution with PG.

For almost every design decision, MySQL provides ready answers to a question: “How can this be designed differently?” This is why the MySQL’s codebase is very valuable for PostgreSQL hackers.

MS SQL has a very coherent ecosystem, SQL Server Management Studio is fantastic, T-SQL is expressive and simple, EntityFramework and Linq immerse database into C# seamlessly and correctly. Greenplum solves complexities of scaling which previous databases on this list only start to realise and describe.

I’m maintaining WAL-G and Odyssey.

WAL-G is a backup tool that is designed for a cloud and abstracts away complexity of backups for PostgreSQL, Greenplum, MySQL, MS SQL, FDB, MongoDB and some other databases.

Odyssey is the scalable connection pooler.

How do you contribute to PostgreSQL?

I’m spending a big part of every Saturday hacking on my research projects. This hacking includes some patches for PostgreSQL, and some research in data management technologies. When I joined Yandex in 2017, my typical tasks were described exactly as my Saturday was going on at that time. And I wondered: is it 6 Saturdays a week or 6 days work week?

But the focus had changed. When I was hacking on my own, I was working on things that seemed to me useful for some generic user. Now I focus on things that Yandex services need in the database engine.

Any contributions to PostgreSQL which do not involve writing code?

I write very little PostgreSQL core code. I’m happy if I can code 1k lines per week. And much less gets committed. Main value I add is in highlighting problems and necessities of developers who use the RDBMS. Also I’m trying to make a balanced mix of advocacy and critique in my talks at conferences.

What is your favorite PostgreSQL extension?

I like the idea of Piecewise Geometric Model indexing and want to implement it as a B-tree derivative. This extension does not exist yet, only an empty repository :)

What is the most annoying PostgreSQL thing you can think of? And any chance to fix it?

Oh, I have a huge board of annoying things :)

  1. MVCC can be done better. HOT chains must favor access to new versions instead of old. VACUUM must not scan the whole changed page set. But I vaguely understand what to do here, my codebase expertise is not nearly enough to work on this.
  2. We store uncompressed data on disk, send uncompressed data over the network, WAL-log uncompressed data. Dan from my team is already working on changing some of this.
  3. Double buffering of data in OS page cache and in shared buffers is wrong. I tried to add my 2 cents but currently I’m not really involved in a lot of work done here by community.
  4. The SLRU subsystem is unscalable, is not protected with checksums, and is prone to bloat and bugs.
  5. Our checksums implementation does not protect from firmware\FS\OS bugs when a segment of pages is from the old version.

Etc etc etc.

Everything will be fixed eventually and of course new problems and annoyances will arise.

What is the feature you like most in the latest PostgreSQL version?

I really like that pg_surgery and heap_check are now part of the core. I’m a bit paranoid about corruption checks.

Adding to that, what feature/mechanism would you like to see in PostgreSQL? And why?

For decades DBMSs rely on local disks as storage. This approach is not very scalable and can be enhanced. Good proof-of-concept is the Amazon Aurora implementation. And I think a new layer of abstraction is necessary to make Postgres live on top of scalable network services. I believe projects like Zenith will help to bring this future closer to reality. At Yandex we are putting some effort in this direction as well.

Could you describe your PostgreSQL development toolbox?

I own Linux, MacOS and Windows laptops and alternate development between them. As an IDE I use CLion and VScode. Both integrate nicely with the PG codebase. Some small fraction of the time I code from iPad and code-server installed on a VM in the cloud.

Which skills are a must have for a PostgreSQL developer/user?

Technical communication is the most crucial skill. I think this is true for every project with 100+ active members. Especially when they are not bound by any organization hierarchy, have very diverse views and represent a distributed community.

Do you use any git best practices, which makes working with PostgreSQL easier?

I think what I do is an antipattern. I have at $HOME directories postgres0, postgres1, postgres2…postgresE, postgresF, postgres10… Whenever I do not understand what I was changing in postgresX - I do git clone https://github.com/postgres/postgres postgresX+1.

Yes, I know git was invented for a reason. But to give a name to the branch I need at least a subtle understanding of what was the purpose of changes.

Don’t do as I do. Perhaps I will stop this soon too.

Which PostgreSQL conferences do you visit? Do you submit talks?

I always have some ideas to discuss. Thus I submit a talk proposal to any community conference that fits in my schedule.

I was very impressed with the simplicity and effectiveness of communication on PGCon . I’m organizing the Yekaterinburg Database Meetup, but there was no single meetup after the COVID era. Every year I visit PGConf.Russia to run 10km of pg_run with Oleg Bartunov.

Do you think Postgres has a high entry barrier?

SQL has a high entry barrier. Few people can meaningfully tell serializability from linearizability. Programming has a high entry barrier. Try to explain recursion and abstraction to a student. Computers have a high entry barrier. Keyboards are by far more complex than piano, yet I haven’t heard of millions of talented pianists. Unfortunately, Postgres is not an exception. There is a high entry barrier. We TOAST the data if the slice is too big. You can VACUUM the whole production database, but never VACUUM FULL the production database.

What is your advice for people who want to start PostgreSQL developing - as in, contributing to the project. Where and how should they start?

To make something meaningful you need to fix a real problem. You need to find a problem, verify that the problem is in PostgreSQL and only then fix something.

Remember that PostgreSQL is not the only database in the world, there are a whole lot of forks! And the farther the fork is from vanilla Postgres - the bigger is the chance to find an inconsistency or a bug. Postgres-related projects always have huge demand for spare hands and eyes to fix issues. You can always start by fixing something in WAL-G, for example :) There are 154 open issues now. And counting.

Do you think PostgreSQL will be here for many years in the future?

Postgres amasses an enormous amount of installations. There is no doubt that it will be used for many years in the future.

But the trickier question - will PostgreSQL still be the most advanced open source database? I hope so, but it’s a difficult challenge.

Postgres successfully played the extensibility card, but also accumulated architecturally questionable decisions. Most advanced open source database must:

  • Fix annoying things from list above
  • Operate just fine without connection poolers
  • Scale seamlessly out of the box beyond 1000 servers for writing workload with a minimum TCO
  • Do not require maintenance like VACUUM, repack, and checkpointing at all
  • Avoid incidents from switching to wrong plan
  • Provide support for ML pipelines, be on the wave of what data science needs

These and many other concerns will determine if Postgres will be the most advanced database in the next decade. There always will be some entry barrier, but we must work to push this barrier to bare minimum.

Would you recommend Postgres for business, or for side projects?

If your expertise is in data management - I’d recommend using something else for side projects. Postgres is a lingua franca of databases, if you do not have extra time for experiments - Postgres is fine. But do not forget to try other ways too - just to compare.

It is important for Postgres itself - your awareness of concurrent technologies helps Postgres to be on the edge. (Keep in mind that experiments can cost you all the data of your side project. But in most cases the experience is still worth it.)

Are you reading the -hackers mailinglist? Any other list?

I’m reading -hackers , -bugs , -committers and -admin lists. From time to time I also read postgis-devel, greenplum-dev and nginx-dev.

What other places do you hang out?

I check Slack once a week.

Which other Open Source projects are you involved or interested in?

Open Source is not just a bunch of free programs to use. It’s a tool to share ideas implemented in strict and correct code. I really admire My First Calculator. It reminds me how perfect software should look like. Another interesting example of an idea in Open Source is Quine Relay. Though I cannot neither add something nor remove anything from both projects. I’m not that creative yet.

Anything else you like to add?

COMMIT;