Soumyadeep Chakraborty

Tags:   postgresql    greenplum    vmware    jit    postgres    table am   
Category:   Interviews   
Interviewed by: Andreas Scherbaum

PostgreSQL is the World’s most advanced Open Source Relational Database. The interview series “PostgreSQL Person of the Week” presents the people who make the project what it is today. Read all interviews here.

Please tell us about yourself, and where you are from.

My hometown is Kolkata, India and I have spent the majority of my life there, including 3 years in the tech industry. More recently I have been living in the United States: I lived in Long Island, NY for a year and a half. I have been living in San Jose, California for the past two years, working at VMware.

Soumyadeep Chakraborty

Soumyadeep Chakraborty

How do you spend your free time? What are your hobbies?

I am into console gaming and I watch a lot of football (ardent Man United fan). I enjoy walks and hikes with my wife and I also play table tennis (sadly not a whole lot since the pandemic started).

Any Social Media channels of yours we should be aware of?


Last book you read? Or a book you want to recommend to readers?

1984 by George Orwell. A must read if you are into dystopias!

Any favorite movie, or show?

I watch quite a few TV shows. My favorites would be the anime Full Metal Alchemist and Breaking Bad.

What’s still on your bucket list?

Watching Man United play at Old Trafford, Manchester!

What is the best advice you ever got?

This is a gem I picked up in grad school - advice on how to read research papers. The approach is very much applicable to long mailing list threads too - to approach the task in passes - a high level pass, followed by a deep dive. The high level pass is very useful in gathering requirements/concerns voiced by contributors - and deep dives can help verify how those requirements/concerns have been addressed.

When did you start using PostgreSQL, and why?

I started using Postgres pretty recently. Since 2018, I have been hacking on Greenplum at VMware, which is a fork of Postgres and is an MPP data warehouse. That involves quite a bit of spelunking and hacking on the Postgres source fortunately! Merging Greenplum with upstream versions of Postgres is also quite a fun activity that I get to participate in.

Do you remember which version of PostgreSQL you started with?

I think I used version 12 the first time.

Yep, recently I completed my MS in Computer Science from Stony Brook University, New York in 2018. The grad level courses I took in Databases, Distributed Systems, Computer Networks and Compilers are very relevant to hacking on an MPP data warehouse. Plus, working on a Compiler research project there, introduced me to LLVM - which is essential for working on JITed query compilation.

What other databases are you using? Which one is your favorite?

In a past life, I was an applications developer writing backend and frontend code. So, I used a variety of databases. Oracle Community Edition was the first relational database I used for school projects when I was doing my Bachelor’s degree - that is what I used to learn SQL with. I used MySQL for a while at work - I would have to say that MySQL Workbench has amazing tools for schema modelling. Apart from that I have used SQLite for toy Android applications. MongoDB was a refreshing intro into NoSQL - they have great training material. Other than that I have worked with ORMs (JPA/Hibernate) and JDBC.

I work on Greenplum’s server, mostly throughout the entire backend. Also I contribute to the surrounding utilities (backup-restore, replication, disaster recovery, upgrade etc.)

How do you contribute to PostgreSQL?

I mostly contribute through patch review. (either after reading emails or looking at commitfest entries.) Or if I come across a bug or possible improvement for Postgres while hacking on Greenplum, I submit a patch or start a discussion on pgsql-hackers. Also, I have in the past submitted patches addressing TODOs I found in the code. I have proposed, along with my co-workers, a fairly large patch that makes the table AM APIs more friendly to column stores such as Zedstore - a project that I have contributed to.

Any contributions to PostgreSQL which do not involve writing code?

I have evaluated the performance of a few patches.

What is your favorite PostgreSQL extension?

I think auto_explain is a really powerful tool and can be very helpful in debugging slow queries in customer environments. pgbench is a super useful tool for profiling and weeding out concurrency bugs! This one is not really an extension but is a very useful tool for TPC-H profiling maintained by Tomas Vondra.

What is the most annoying PostgreSQL thing you can think of? And any chance to fix it?

Having a way to dump and restore statistics would be really awesome. It would be a killer feature for customers who have large databases. Right now, users lose stats during a backup-restore or an upgrade. Then they have to run ANALYZE again which can take a long long time. We do this in Greenplum today with custom tools which require much maintenance overhead. There was a patch submitted long back which took a stab at it. One just needs to follow Tom Lane’s advice on that thread and resurrect it!

What is the feature you like most in the latest PostgreSQL version?

The work done by Andres Freund to improve snapshot scalability.

Adding to that, what feature/mechanism would you like to see in PostgreSQL? And why?

I would like to see improvements in and contribute to the JIT bits. Caching of generated code would really accelerate OLAP queries. With the adoption of LLVM 12 in PG and ORC v2, there is much scope to implement features such as parallel compilation, background compilation and lazy compilation.

Could you describe your PostgreSQL development toolbox?

I develop on Ubuntu, with CLion. I use gdb for debugging. I use perf and BPF for profiling. You can do a lot with psql too! EXPLAIN ANALYZE with the right options is also a very precise tool for evaluating performance. For hacking on JIT, I sometimes use a debug build of LLVM.

Which skills are a must have for a PostgreSQL developer/user?

For a developer, it is the ability to read a massive and often foreign codebase, with the right navigational and analysis tools. And to think of the most minimal way to exercise the code with a SQL statement. Then one can attach a debugger, flip on some debug GUCs, and inspect the behavior. For fixing bugs, it is ideal to write a minimal breaking test with the rich testing infrastructure that Postgres has. For both developers and users, reading the comprehensive and brilliant documentation that Postgres possesses is super important.

Do you use any git best practices, which makes working with PostgreSQL easier?

Not many, my git workflow is pretty minimal. git format-patch is what I use for submitting patches. Perhaps the most important git tools for developing on Postgres is git blame, git grep and git log with pickaxe (-S) - reading history is vital for gaining context - especially when chasing bugs. git bisect is a great tool for pinpointing the commit at which a bug was introduced.

Which PostgreSQL conferences do you visit? Do you submit talks?

Pgcon Ottawa is by far my favorite. I attended Pgcon 2019 and I was a speaker at Pgcon 2020 on Zedstore with my colleague Alexandra Wang. I spoke on Zedstore with Alexandra again at Linux Open Source Summit 2020, in the database track. I also attended Pgday SF in 2020.

Do you think Postgres has a high entry barrier?

Not at all! It has quite the comprehensive documentation, well written and documented code. Compared with the LLVM codebase which sports sparse API, code and user-facing documentation, Postgres is miles ahead in this regard!

What is your advice for people who want to start PostgreSQL developing - as in, contributing to the project. Where and how should they start?

Clone the project and follow the instructions here and here. Subscribe to pgsql-hackers and attend conferences. Start out with an extension (contrib/) or even a utility (src/bin) - it is easier given their confined scope. For dealing with the backend code, always try to map it to a feature and consult the user-facing documentation. Visit the commitfest pages and look for threads! There are many ways to contribute - and not all of them have to be writing code.

Do you think PostgreSQL will be here for many years in the future?

Many many years! I believe the community is elastic enough to undergo change and is very inclusive and friendly to new contributors!

Would you recommend Postgres for business, or for side projects?

Absolutely! The possibilities are endless for side projects - there are so many fun extensions (take a look at PostGIS, Madlib etc).

Are you reading the -hackers mailinglist? Any other list?

Yes, -hackers is the list to read. I also find the -bugs mailing list to be a treasure - especially if you are looking for a patch to write.

What other places do you hang out?

You can find me on the Greenplum Open Source Slack.

Which other Open Source projects are you involved or interested in?

LLVM! I wish I could contribute, but I can’t at the moment, given the lack of time.