In this post, we'll talk about generating identifiers or primary keys for our database and what options we have.
Table of Contents
Of course, this discussion is for relational databases; for other types of databases, the answer could be very different.
1 – Which identifier to use in a database?
Many of us, myself included, have always wondered which type of ID we should use in our database. And until about 10 or 12 years ago, there really wasn’t any doubt, it was simple: we used an integer (Int) auto-increment ID, so that each insert would increase the value by 1. However, around 10/12 years ago, the trend started shifting in favor of UUIDs (or GUIDs in C#).
As with everything in programming, each option has its pros and cons, and here we’ll look at when each is best, depending on the scenario.
First, let’s look at what our options are.
2.1 – What is an auto-increment ID?
The first and clearest option: an int with auto-increment. It’s pretty straightforward, each new value that we insert is assigned a sequential numeric identifier that’s automatically incremented. By default, it starts at 1, but you can change that. Plus, if you have more than one table, each table starts at 1, meaning that ID 1 in the "cars" table refers to one record, while ID 1 in the "motorcycles" table refers to a different record.
2.2 – What is a UUID or unique identifier?
The second option is to use a UUID, which stands for Universally Unique Identifier. The key word here is unique, since it means that the ID will only exist once, not just in your system, but in the world. That’s because it uses the system’s current time and a random property from the executing machine (like its public IP, the MAC address, etc.).
For anyone unfamiliar with what a UUID looks like (version 4, the commonly used one), here is an example:
Image from https://segment.com/blog/a-brief-history-of-the-uuid/
Technically, a UUID can be repeated, but it’s extremely, extremely unlikely. Mathematically, the probability of a collision is 1 in 2.71 * 10^18, which is an astronomically large number. To have a 50% chance of collision, you would need to generate a billion UUIDs per second for 85 years.
So, you can rest easy. If it ever does happen (it has happened), you can always fix the issue manually and write a post about it.
For those of you using C# (like me), know that UUID and GUID are the same, the difference is that Microsoft implemented their own version with their own quirks.
3 – Pros and cons of using an auto-increment ID
Let’s start with the most obvious: auto-increment IDs are fast for querying. When your primary keys are unique IDs and you query by ID, the database knows exactly where that data is, it’s the fastest possible access.
You might think that UUIDs are also very fast for access, and in fact, PostgreSQL has a specific type for them. And yes, they’re fast, but depending on the amount of data in your table, you could start to notice performance issues. Here, I’m not talking about tables with a couple million records, I mean with billions of rows.
Auto-increment IDs are easy to implement, you set them up in the database, and the DB itself assigns the value. This can be both a pro and a con in some scenarios, especially if you need the new ID returned. It’s true that Entity Framework will return it, but depending on your DB (or ORM), you might need to do a select after the insert to get the ID.
Finally, if you have auto-increment IDs, it's usually easier to inspect the data. Let me explain: we've all had to check database values when something’s not working or making sense, and with numeric IDs it’s just easier to spot things, since they're numbers, not 16 totally random values.
Another huge downside is that by using numbers, it’s easy for attackers or malicious actors to figure out how to try to hack you.
In fact, this happened at two companies where I’ve worked. In this particular case, I worked for the parent company but we were in the same office.
Let’s say a generic user had access to the information of users from their tenant (like a social network, seeing your friends, etc.). These users didn’t see any IDs in the interface, but obviously, when you called the backend, the ID was in the payload, same for reading. Well, these attackers saw it was an auto-increment ID, changed the ID in the payload, and grabbed all the data for all users across all tenants. The backend checked that the caller had a valid token, but didn’t check the tenant or access level.
Luckily, it didn’t go any further, the attackers were two high school students who used it just to see their classmates’ grades 😂. The plus side is that the parent company I worked for invited both kids to spend two weeks, all expenses paid, at their offices in Norway with the pentesting team to see if it interested them. The company in question is a SaaS valued at 20 billion euros, so it was a pretty good offer.
4 – Pros and cons of using UUIDs
While querying by UUID isn’t extremely efficient, it’s not terrible either. With good indexing, the problem is mostly mitigated, so we'll call this a neutral downside.
One thing I’ll mention is that UUIDs are usually generated outside the database. Some databases (not all) have features to generate them during the insert, but often this responsibility is left to the code.
Some databases can use sequential UUIDs, letting the DB itself generate these IDs. The advantage is they don’t fragment storage, which is what causes query issues; if you’re going to do this, use auto-increment IDs, it’s my opinion.
Of course, the big benefit here is that UUIDs are almost impossible to guess, well, maybe not impossible, but extremely unlikely, so you shouldn’t have issues returning these values to users.
As I mentioned before, these are unique system identifiers, if you have distributed applications or even distributed databases, you can be sure that IDs will not ever collide. For example, imagine you have one database per tenant, the IDs for one tenant will NEVER be the same as those in another. It might sound silly, but I've seen situations where having 10 products with the same ID turns into a nightmare, especially if your logs aren’t great.
Not only does this approach help solve that type of problem, it also helps with replication and migration tasks.
5 – Use an auto-increment ID or a UUID as a primary key?
The million-dollar question: which one should you use? First, let’s say that if your app is small, you can completely ignore the performance issue. For example, this blog uses strings for lots of IDs, why? Because I have 200 records, so it doesn’t matter for performance.
The most important thing to know is whether you will return these values outside of your application or not. Generally, NOTHING that is sent outside your application, whether to the user interface, via email, in URLs, or in a payload, should be simple IDs, since sequential IDs are easy to guess. You could be exposed to attacks like the one described above. That particular case wasn’t a breach that destroyed a company, but it could have been.
Even OWASP recommends using complex IDs as a defensive programming measure. One option is to obfuscate the auto-increment ID. Obfuscating isn’t really secure, but it's much better than exposing the raw ID.
This isn’t the only time it’s happened at companies I've worked for, one time a developer was fired because the system allowed access to all client information, and Google even indexed it… I was the one who built the next version of the system, but without those problems, obviously 👀.
Personally, I would not expose auto-increment IDs if I’m revealing sensitive information. You might lose a tiny bit of performance, but you gain peace of mind, and it's not even noticeable.
5.1 – Combining numeric IDs and UUIDs
I’ve also seen setups where both types are combined: user-facing operations always return the UUID, while table relations internally use numeric IDs.
I’m not the biggest fan in the world, but it’s not a bad idea, especially if you have lots of joins between different tables in your DB.
5.2 – Squids in .NET
Finally, as an alternative, we have Squids, which is just a library that allows you to generate unique identifiers from numbers, making them safe for use in URLs.
Think about YouTube, for example, the video identifier is very likely that YouTube is using an auto-increment ID internally.
This identifier is purely visual and very easy to use, both for encoding and decoding. You just have to do the following:
var sqids = new SqidsEncoder<int>();
var id = sqids.Encode(1, 2, 3); // "86Rf07"
var numbers = sqids.Decode(id); // [1, 2, 3]
If you’re curious, check out their GitHub page, where there’s much more info, for example, you can configure your own character dictionary to make it harder to decrypt, block certain characters, avoid certain sequences, and more.
If there is any problem you can add a comment bellow or contact me in the website's contact form