By: Andrea Gnemmi | Updated: 2024-01-12 | Comments | Related: > PostgreSQL
Problem
PostgreSQL is an open-source RDBMS that has received much attention in the last 2 or 3 years, with many companies worldwide switching or migrating to it. As the open-source RDBMS go-to, it offers a wide variety of features, is SQL ANSI-complaint (more than other RDBMS), and is a stable and working platform. But is it new? What are the roots and history of PostgreSQL? What are some of its features and terminology?
Solution
In this tip, I will highlight PostgreSQL's history and features. This introduction to PostgreSQL is also intended as a framework for future articles on the technology.
History
PostgreSQL began in 1975 at the University of California, Berkeley, with a project named Ingres. From that original project came three different "forks":
- Ingres, which continued as Ingres Corporation and is still active today.
- Informix
- Sybase, which evolved into SQL Server. Some say PostgreSQL and SQL Server share the same roots!
What happened to the original project at UC Berkeley? It evolved into Postgres, then PostgreSQL, the current community-based, open-source product. Apart from the open-source version, there are some other PostgreSQL-based RDBMS like EDB, which maintains the pgAdmin GUI open-source tool, and some specialized tools such as Timescale, a time series database built on PostgreSQL. These are only a few examples of a vast community-based and licensed ecosystem.
So, PostgreSQL is a mature product with a long development history and an engaged community. The current PostgreSQL is version 16, which has many new features.
Limits, OS Supported, and Terminology
PostgreSQL is available in a wide range of OS and cloud environments. Since it's an open-source project, most installations are under a Linux distribution, such as Red Hat or Ubuntu. PostgreSQL is also available in Windows and on all leading cloud platforms with proprietary versions, such as Aurora DB in AWS or Azure Cosmos DB, all based on PostgreSQL.
General limits of PostgreSQL include:
Limit | Value |
---|---|
Maximum Database Size | Unlimited |
Maximum Table Size | 32 TB |
Maximum Row Size | 1.6 TB |
Maximum Column Size | 1 GB |
Maximum Rows per Table | Unlimited |
Maximum Columns per Table | 250-1600 (Data type depending: Must fit on a single page) |
Maximum Indexes per Table | Unlimited |
Note: This limit on the maximum columns per table depends on the data type of the columns, as it must fit on a single memory page. The rest of the limits are self-explanatory.
What is quite different from other RDBMSs is the terminology of PostgreSQL. Below is a "translation table."
"Normal" Term | PostgreSQL |
---|---|
Table or Index | Relation |
Row | Tuple |
Column | Attribute |
Data Block | Page (on storage) |
Page | Buffer (on memory) |
Transaction Log | WAL (Write Ahead Log) |
Instance | Cluster |
While there are terms like Tuple that are widely known (or they should be by database professionals!), others like WAL are pretty obscure outside the PostgreSQL world. WAL is the same as the transaction log in SQL Server or Archive log in Oracle. Another term that can cause some confusion is Cluster. PostgreSQL uses it to identify a cluster of databases under the same instance, while in SQL Server, it is effectively an instance.
It is important to notice that a cluster (instance) in PostgreSQL can have more than one database, just like SQL Server (and not Oracle). Still, performing queries from one database to another is impossible, even under the same cluster. To do so, we use Foreign Data Wrappers (same as linked servers). This will be discussed later in this article.
Features and Differences with SQL Server
Some of the features of PostgreSQL are common in all RDBMS, such as indexing, partitioning, replication, etc. There are some differences in how these are implemented in PostgreSQL. I want to outline the differences compared to SQL Server. It will not be a comprehensive list nor a deep dive into each topic.
Data Types – JSON Management
There are some differences in data types. For example, in PostgreSQL, we have timestamp with or without time zone, instead of datetime. Another example is if we want to specify the maximum possible value for a varchar or nvarchar, PostgreSQL uses varchar() or text instead of nvarchar(max). But the big difference is in the management of JSON data, as PostgreSQL probably has the best and most comprehensive functions in the entire RDBMS ecosystem.
In PostgreSQL, two data types are dedicated to JSON: json and jsonb. The first was developed at the beginning to store JSON semi-structured data, while the second data type, jsonb, is the go-to choice to query with JSON functions that column, as it can be indexed. Some additional operators can be used to search indexed jsonb data better.
The list of operators available is huge. For example, PostgreSQL has some comparison operators:
- Extract the JSON field with the given key
json -> text -> json
jsonb -> text -> jsonb
- Extracts JSON object field with the given key as text
json ->> text -> text
jsonb ->> text -> text
They can be used to filter or extract data from the JSON data column like this:
select order_details -> 'track' as track from orders where order_details ->> 'customer' like 'Astrid%';
All the various functions and operators can be seen in the official documentation of PostgreSQL.
Plus, I have plans for writing an article on this subject.
T-SQL vs. pg/plsql
PostgreSQL has its own SQL dialect and procedural language similar to Oracle PLSQL called pg/plsql, which is used in functions and procedures. Specificities include:
- CREATE TABLE IF NOT EXISTS – Gives the possibility of issuing a create table and skipping it to avoid an error if the table is already in the database/schema (There is the equivalent also for views, procedures, etc.).
- CREATE TABLE (LIKE TABLE) INCLUDING (options) – Create a table using another one as a template, including or not including constraints, identity columns, etc.
- CREATE TABLE… INHERITS – Offers a hierarchy in the tables. It was used for partitioning tables in the past (until version 10).
- TEMP TABLE ON COMMIT PRESERVE ROWS, DELETE ROWS, DROP – We are all familiar with the concept of a temporary table, which in PostgreSQL is almost the same as in SQL Server. Here, we have options (preserve, delete, and drop) that give more possibilities for what to do with the temp table.
- SELECT … FOR UPDATE WAIT or NOWAIT – A concept familiar to Oracle folks. Option to lock a row with a select for updating it in a second moment, waiting or not waiting if the row is already locked. Remember that PostgreSQL implements the concurrency without locking but with multi-versioning. More on that later!
- CREATE DATABASE using a template database that can be built ad hoc.
- Variables in functions and ad hoc queries – in line code blocks
- Procedures and functions differ somewhat from other RDBMS, notably SQL Server. More on this topic in upcoming articles.
- Prepared statements, PREPARE - Only prepared and functions/procedures queries have their plan cached. This was the topic of one a previous tip: PostgreSQL Prepared Statements to Enhance the Performance of Frequently Used Queries
All these topics will be developed in depth in upcoming articles.
User/Roles, Schemas, and Owner
In PostgreSQL, the concept of Users/Roles and Groups interlaps. Users are called Roles, and Groups are basically Roles without the Login option. One of the best practices is to assign privileges to groups and then add the membership on the group to the user (role).
The concept of owner in PostgreSQL is also different, as we can have ownership on a Schema, but that does not mean ownership of all the objects in the schema! Next are the Public role and the public schema, which are automatically accessible by each new user. Also, the public schema is the only one by default in the search path, which means that if we do not specify the schema when an object is created, that one will be created under the public schema. Since it is a broad subject, I introduced it in the tip Privileges and Roles in SQL Server, Oracle and PostgreSQL - Part 1, but we will dive deeper into this topic in future articles.
Views and Materialized Views
The concept of Views is the same as in SQL Server. But, PostgreSQL has Materialized Views like Oracle, similar to indexed views in SQL Server. There are some differences in syntax. For example, the availability of CREATE OR REPLACE VIEW. PostgreSQL does not return an error if we try to create an existing view; it will substitute it.
Another interesting concept is the updatable view that comes with some restrictions. It is only possible if the view is based on one table with no other joined tables and not on computed columns.
A Materialized View or MV has many available options, such as creating it with or without data in it and then refreshing it a second time with the Refresh materialized view command. Or the option CONCURRENTLY, to access it during a refresh, is possible only if a UNIQUE index is present on the MV.
I wrote a tip on Views and Materialized Views in SQL Server, Oracle, and PostgreSQL that addresses some of these concepts: SQL VIEW Comparison in SQL Server, Oracle and PostgreSQL.
Sequences and Identity Columns
In PostgreSQL, the Identity column is inherited from sequences, and before version 10, it was accomplished using the SERIAL parameter:
- Syntax GENERATED [Option] AS IDENTITY
- Option can be:
- ALWAYS - a value is created with every insert, and it is impossible to insert a value into this column. This behavior is similar to SQL Server.
- BY DEFAULT - in this case, the number is generated as a default, so a value can always be inserted.
- BY DEFAULT ON NULL - the number is generated only if a NULL value is used for that column.
A two-part comprehensive guide on Identity Columns is available for more information:
- SQL Identity Columns in SQL Server, Oracle and PostgreSQL - Part 1
- SQL Identity Column Gaps and Conversion in SQL Server, Oracle and PostgreSQL
Working with Date Time and Character Strings
Working with dates and strings in PostgreSQL is somehow more intuitive than with SQL Server and more SQL ANSI Standard. For example, we have the current_timestamp function instead of getdate(), but also CURRENT_DATE and CURRENT_TIME, NOW() (equal to current_timestamp). Additionally, PostgreSQL can add or subtract an interval of time from a date with a + or – instead of using a function like DATEADD. AGE rather than DATEDIFF and DATE_PART is slightly different, but there is also EXTRACT, which is ANSI Standard.
The way of concatenating strings differs: we have the pipe sign || or CONCAT instead of the + character. There are also differences in the data type VARCHAR(N) and TEXT instead of VARCHAR(MAX) and TEXT. Interestingly, we also have the availability of POSIX Regex expressions.
Additional articles on date and string functions include:
- SQL Server, Oracle and PostgreSQL Date Functions
- SQL String functions in SQL Server, Oracle and PostgreSQL
The Absence of a Scheduler
In PostgreSQL, there is no Scheduler. However, various solutions are available, such as using Linux cron and crontab or third-party software like pgAgent. This is an extension (to be discussed later) and is one free alternative, community maintained by EDB and similar to other RDBMS Agents. It is still based on cron and uses cron syntax. It is used with pgAdmin, the de facto standard GUI for PostgreSQL, either developed or maintained by EDB.
Foreign Data Wrappers (FDW)
In SQL Server, we use linked servers. But in PostgreSQL, we have Foreign Data Wrappers (FDW):
- Foreign Data Wrapper feature enables access to non-PostgreSQL data as well as to other databases in the same PostgreSQL cluster
- Acts like adapter/interface to different data sources from PostgreSQL
- Read non-relational data like CSV, JSON, XML
- All data appear as local PostgreSQL tables
Concurrency - MVCC Implementation – Vacuum and Vacuum Freeze
PostgreSQL implements Concurrency using Multi Version Concurrency Control (MVCC). It is similar to using READ_COMMITTED_SNAPSHOT in SQL Server, so there are no locks but various versions of the row at different times. However, how it is implemented is different. In this case, there is no snapshot but version numbers. MVCC depends on comparing transaction ID numbers: a row version with an insertion XID greater than the current transaction's XID is "in the future" and should not be visible to the current transaction. This introduces a further problem as the transaction IDs have limited size (32 bits), so a cluster that runs for a long time (more than 4 billion transactions) would suffer transaction ID wraparound! This causes a catastrophic data loss (actually, data is still there, but you cannot access it!). To avoid this, every table in the database must be vacuumed with option freeze at least once for every 2 billion transactions. Fortunately, this works automatically under the hood using AUTOVACUUM. There is much more to say on it, especially on VACUUM procedures, but I will leave it for further articles on this topic.
Partitioning
We all know the benefits of partitioning a large table to have smaller data sets to query using the partition key. In PostgreSQL, there are two different techniques used to partition a table: inheritance partitioning, which is the old method used before version 10, and declarative partitioning, which is similar to SQL Server.
Inheritance is a feature on tables that lets you create a hierarchy between tables. This allows data to be divided according to the user's choice, and triggers must be used to redirect inserted data to the appropriate partition. This is quite complicated as a lot of the work must be manually done to have a complete solution.
Instead, declarative partitioning can be done in three different ways:
- Range Partitioning: Defined via key column(s) with no overlap or gaps.
- List Partitioning: Explicitly listed for the partitioning scheme.
- Hash Partitioning: Specified by a modulus and a remainder for each partition.
I recently wrote an article explaining all the syntax and features of partitioning in PostgreSQL: PostgreSQL Partitioning Tables - Learn the Syntax, Features and Options.
PostgreSQL Extensions
PostgreSQL has a unique feature that, in many ways, enhances the capabilities
of the RDBMS: Extensions. In short, these are Additional Supplied Modules
in the "extension" directory that add functionalities to PostgreSQL.
Extensions are not included in the core database and may still be under development.
They are part of the contrib package and can be installed using
yum install postgresql16-contrib
.
There are a lot of different extensions offering many features. For example, all FDW are extensions. Another very useful extension is pg_stat_statements, a monitoring tool that helps a lot in the performance tuning of queries. I will write more also on this topic.
A complete list of all extensions is available by querying the system view
pg_available_extensions
.
Indexes – BRIN Type Index
Finally, a few words on indexes in PostgreSQL, a topic that will receive a lot more attention in future articles. For this introduction, we can say that PostgreSQL supports multicolumn and functional indexes and can use combined indexes. Moreover, we can also build filtered indexes in PostgreSQL called partial indexes. With the option INCLUDE, we can include other non-key columns in the index, just like in SQL Server.
Below is a quick list of all the index types available in PostgreSQL with a description:
Index | Description |
---|---|
B-tree | Equality and range queries on data that can be sorted (Default index type) |
Hash | Hash indexes store a 32-bit hash code derived from the value of the indexed column. Hence, such indexes can only handle simple equality comparisons. The query planner will consider using a hash index whenever an indexed column is involved in a comparison using the equal operator. |
GiST | Can be used depending on the indexing strategy (the operator class) |
SP-GiST | Like GiST, offers an infrastructure that supports various kinds of searches. |
GIN | Can handle values that contain more than one key; for example, arrays that are inverted indexes |
BRIN (Block Range Index) | Accelerates scanning of large tables by maintaining summary data about bloc ranges. Very small index size compared to B-Tree. The tradeoff is that you can't select a specific row, so they are not useful for all data sets. |
Next Steps
There is much more to say about other features of PostgreSQL, such as the replica, all the high availability parts, or installation and setup of PostgreSQL. These will be addressed in future articles with a deep analysis of each feature.
This article is a starting point or framework for all the tips that follow on PostgreSQL, so stay tuned. More will follow!
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2024-01-12