Kohei Kaigai



Tags:   postgresql    heterodb    japan    pgsql-hackers    parallelization    custom-scan-provides    foreign-data-wrapper    gpu    nvme    pg_strom    postgis    selinux   
Category:   Interviews   
Interviewed by: Andreas Scherbaum

PostgreSQL is the World’s most advanced Open Source Relational Database. The interview series “PostgreSQL Person of the Week” presents the people who make the project what it is today. Read all interviews here.

Please tell us about yourself, your hobbies and where you are from.

I grew up in Shiga, Japan. It is known as the lake-side area of the greatest lake in our nation. I studied computer science and business administration at the University of Tsukuba. After that, I started my first career at NEC, as a developer of Linux kernel. In 2006, I started to develop PostgreSQL internal and joined the community for a series of security enhancements. Recently, I’m interested in the integration of GPU’s computing power with database workloads. 2017, I established HeteroDB for productization of the technology. I am also a father of two kids, and a player of curling.

Kohei Kaigai

Kohei Kaigai

Any Social Media channels of yours we should be aware of?

When did you start using PostgreSQL, and why?

As a user, I had used PostgreSQL 7.x for the backend of PHP scripts of a web system, likely 2000. As a developer, I began investigation and enhancement at PostgreSQL 8.2 for SELinux support in userspace.

Do you remember which version of PostgreSQL you started with?

8.2

Yes, I had studied computer engineering at the undergraduate course. Operating system (including parallel, distributed computing) and computer graphics courses were the most favorite for me, however, my score on database theory is not very good. :-)

What other databases are you using?

No. :-)

PG-Strom. Recently, I’m working on a transactional GPU memory store and GPU version of PostGIS functions. For the core 14devel, asymmetric partition-wise join, truncate on foreign-table, and 16bit float support are in progress.

How do you contribute to PostgreSQL?

As my community activity, I wrote and submitted various kind of patches over the last ten years. Some of them had a long-standing discussion on pgsql-hackers, to finalize the design and implementation. The most significant mind-set for open source developer is “never give up”, I think.

Any contributions to PostgreSQL which do not involve writing code?

For example, I had suggested to our customer who wanted to migrate legacy PostgreSQL 9.1 with a terabytes database to RedShift or ExaData, that a more reasonable schema definition and a bit of query tuning is way better. In the end, they are still using PostgreSQL, but upgraded to v10. Not only code development is important, or a presentation at a conference, I think people’s daily jobs are also contributions to PostgreSQL.

What is your favorite PostgreSQL extension?

Citusdb, to handle larger data than single node capacity, even if GPU and NVME are used. Auto_explain and pg_buffercache give us a clear hint for performance issues.

What is the feature you like most in the latest PostgreSQL version?

Pluggable storage APIs. Nowadays, PostgreSQL has three different ways to extend SQL execution (FDW and CustomScan in addition to Pluggable storage). They have different characteristics for each, so developers can choose the best way for their purpose.

Adding to that, what feature/mechanism would you like to see in PostgreSQL? And why?

Asymmetric partition-wise join - when we process large scale data (like IoT/M2M log data) with relatively small master tables, we often want to partition the fact table only. In this case, current version of PostgreSQL does not support join between partition-leafs and non-partition tables before merging. I submitted a patch for this feature in 13 development cycle, but could not get enough attention from the people. I like retry it in the v14.

Cost estimation with volatility consideration - when optimizer estimates the cost for candidate paths, it usually relies on estimated number of rows. In case of simple table scan with latest statistics, it is almost accurate. However, its accuracy goes worse and worse with complicated sub-queries and multi-level JOINs. This kind of gap from the actual number of rows may generate catastrophic query execution plan. For example, even if optimizer expects nested-loop with 2 x 1000 rows, it is not strange the child relation generates 100 x 5000 rows. In this case, execution cost is 250 times larger than the estimation. I like to have a concept of volatility factor for cost optimization. Even if estimated cost is small, it may be risky when estimated number of rows is not accurate. In this case, the “second best” might be the “actually best” path. It is just an idea status.

Could you describe your PostgreSQL development toolbox?

emacs, cscope, and less. Nothing special…

Which skills are a must have for a PostgreSQL developer/user?

C programing language and widespread knowledge around operating systems, especially memory management and storage, are baseline. However, the most significant skill is to introduce your project goal and involve other persons.

Which PostgreSQL conferences do you visit? Do you submit talks?

I’ll submit two talks PGconf.EU 2020. Likely, PostgreSQL Conference Japan and PGconf.ASIA are candidates. Unfortunately, PGconf.NYC in March was canceled. I want to have a chance in USA again.

What is your advice for people who want to start PostgreSQL developing - as in, contributing to the project. Where and how should they start?

The easiest way to develop something workable is implementation of your own SQL-function in C-language. PostgreSQL provides various kind of hooks and infrastructure (like FDW APIs) to extend the feature without modification of the core code. Development of SQL-function in C-language will educate newbie people how to use the build toolchain, how internal modules works mutually, how resources are managed, and so on… Something “workable” is a great first step towards the core development.

Do you think PostgreSQL will be here for many years in the future?

Definitely. PostgreSQL community has flexibility to adopt new concepts and features according to the progress of technologies. Probably, what we call PostgreSQL in ten years later is much different from the one we are looking at now.

Are you reading the -hackers mailinglist? Any other list?

Yes, but not all. Only related topics.

Which other Open Source projects are you involved or interested in?

  • Apache Arrow - a common columnar data format for big-data solutions. PG-Strom also supports direct read of Apache Arrow files for quick data import.
  • RAPIDS cuDF - a data-frame management on GPU device memory. I’ve been interested in the software to connect the world of database and the world of machine-learning.