A Universally Unique Identifier (UUID) is a specific form of identifier which can be safely deemed unique for most practical purposes. Two correctly generated UUIDs have a virtually negligible chance of being identical, even if they’re created in two different environments by separate parties. This is why UUIDs are said to be universally unique.
In this article, we’ll look at the characteristics of UUIDs, how their uniqueness works, and the scenarios where they can simplify resource identification. Although we’ll be approaching UUIDs from the common perspective of software that interacts with database records, they are broadly applicable to any use case where decentralized unique ID generation is required.
What Actually Is a UUID?
A UUID is simply a value which you can safely treat as unique. The risk of collision is so low that you can reasonably choose to ignore it altogether. You may see UUIDs referred to using different terms (GUID, or Globally Unique Identifier, is Microsoft’s preferred semantic) but the meaning and effect remains the same.
A true UUID is a unique identifier that’s generated and represented by a standardized format. Valid UUIDs are defined by RFC 4122; this specification describes the algorithms that can be used to generate UUIDs that preserve uniqueness across implementations, without a central issuing authority.
The RFC includes five different algorithms which each use a different mechanism to produce a value. Here’s a brief summary of the available “versions”:
- Version 1 – Time-Based – Combines a timestamp, a clock sequence, and a value that’s specific to the generating device (usually its MAC address) to produce an output that’s unique for that host at that point in time.
- Version 2 – DCE Security – This version was developed as an evolution of Version 1 for use with Distributed Computing Environment (DCE). It is not widely used.
- Version 3 – Name-Based (MD5) – MD5 hashes a “namespace” and a “name” to create a value that’s unique for that name within the namespace. Generating another UUID with the same namespace and name will produce identical output so this method delivers reproducible results.
- Version 4 – Random – Most modern systems tend to opt for UUID v4 as it uses the host’s source of random or pseudo-random numbers to issue its values. The chances of the same UUID being produced twice are virtually negligible.
- Version 5 – Name-Based (SHA-1) – This is similar to Version 3 but it uses the stronger SHA-1 algorithm to hash the input namespace and name.
Although the RFC refers to the algorithms as versions, that does not mean you should always use Version 5 because it’s seemingly the newest. The one to choose depends on your use case; in many scenarios, v4 is chosen because of its random nature. This makes it the ideal candidate for simple “give me a new identifier” scenarios.
Generation algorithms emit a 128-bit unsigned integer. However, UUIDs are more commonly seen as hexadecimal strings and can also be stored as a binary sequence of 16 characters. Here’s an example of a UUID string:
16763be4-6022-406e-a950-fcd5018633ca
The value is represented as five groups of alphanumeric characters separated by dash characters. The dashes are not a mandatory component of the string; their presence is down to historic details of the UUID specification. They also make the identifier much easier for human eyes to perceive.
UUID Use Cases
The principal use case for UUIDs is decentralized generation of unique identifiers. You can generate the UUID anywhere and safely consider it to be unique, whether it originates from your backend code, a client device, or your database engine.
UUIDs simplify determining and maintaining object identity across disconnected environments. Historically most applications used an auto-incrementing integer field as a primary key. When you’re creating a new object, you couldn’t know its ID until after it had been inserted into the database. UUIDs let you determine identity much earlier on in your application.
Here’s a basic PHP demo that demonstrates the difference. Let’s look at the integer-based system first:
class BlogPost { public function __construct( public readonly ?int $Id, public readonly string $Headline, public readonly ?AuthorCollection $Authors=null) {} } #[POST("/posts")] function createBlogPost(HttpRequest $Request) : void { $headline = $Request -> getField("Headline"); $blogPost = new BlogPost(null, $headline); }
We have to initialize the $Id
property with null
because we can’t know it’s actual ID until after it’s been persisted to the database. This is not ideal – $Id
shouldn’t really be nullable and it allows BlogPost
instances to exist in an incomplete state.
Changing to UUIDs addresses the problem:
class BlogPost { public function __construct( public readonly string $Uuid, public readonly string $Headline, public readonly ?AuthorCollection $Authors=null) {} } #[POST("/posts")] function createBlogPost(HttpRequest $Request) : void { $headline = $Request -> getField("Headline"); $blogPost = new BlogPost("16763be4-...", $headline); }
Post identifiers can now be generated within the application without risking duplicate values. This ensures object instances always represent a valid state and don’t need clunky nullable ID properties. The model makes it easier to handle transactional logic too; child records which need a reference to their parent (such as our post’s Author
associations) can be inserted immediately, without a database round-trip to fetch the ID the parent was assigned.
In the future, your blog application might move more logic into the client. Perhaps the frontend gains support for full offline draft creation, effectively creating BlogPost
instances that are temporarily persisted to the user’s device. Now the client could generate the post’s UUID and transmit it to the server when network connectivity is regained. If the client subsequently retrieved the server’s copy of the draft, it could match it up to any remaining local state as the UUID would already be known.
UUIDs also help you combine data from various sources. Merging database tables and caches that use integer keys can be tedious and error-prone. UUIDs offer uniqueness not only within tables but at the level of the entire universe. This makes them much better candidates for replicated structures and data that’s frequently moved between different storage systems.
Caveats When UUIDs Meet Databases
The benefits of UUIDs are quite compelling. However, there are several gotchas to watch for when using them in real systems. One big factor in favor of integer IDs is they’re easy to scale and optimize. Database engines can readily index, sort, and filter a list of numbers that’s only going in one direction.
The same can’t be said for UUIDs. To begin with, UUIDs are four times bigger than integers (36 bytes vs 4 bytes); for large datasets, this could be a significant consideration in itself. The values are also much trickier to sort and index, particularly in the case of the most common random UUIDs. Their random nature means they have no natural order. This will harm indexing performance if you use a UUID as a primary key.
These problems can compound in a well-normalized database that makes heavy use of foreign keys. Now you may have many relational tables, each containing references to your 36-byte UUIDs. Eventually the extra memory needed to perform joins and sorts could have a significant impact on your system’s performance.
You can partially mitigate the issues by storing your UUIDs as binary data. That means a BINARY(16)
column instead of VARCHAR(36)
. Some databases such as PostgreSQL include a built-in UUID
datatype; others like MySQL have functions that can convert a UUID string to its binary representation, and vice versa. This approach is more efficient but remember you’ll still be using extra resources to store and select your data.
An effective strategy can be to retain integers as your primary keys but add an extra UUID field for your application’s reference. Relational link tables could use IDs to enhance performance while your code fetches and inserts top-level objects with UUIDs. It all comes down to your system, its scale, and your priorities: when you need decentralized ID generation and straightforward data merges, UUIDs are the best option but you need to recognize the trade offs.
Summary
UUIDs are unique values which you can safely use for decentralized identity generation. Collisions are possible but should be so rare they can be discarded from consideration. If you generated one billion UUIDs a second for an entire century, the probability of encountering a duplicate would be around 50% assuming sufficient entropy was available.
You can use UUIDs to establish identity independently of your database, before an insert occurs. This simplifies application-level code and prevents improperly identified objects from existing in your system. UUIDs also aid data replication by guaranteeing uniqueness irrespective of data store, device, or environment, unlike traditional integer keys that operate at the table level.
While UUIDs are now ubiquitous in software development, they are not a perfect solution. Newcomers tend to fixate on the possibility of collisions but this should not be your prime consideration, unless your system is so sensitive that uniqueness must be guaranteed.
The more apparent challenge for most developers concerns the storage and retrieval of generated UUIDs. Naively using a VARCHAR(36)
(or stripping out the hyphens and using VARCHAR(32)
) could cripple your application over time as most database indexing optimizations will be ineffective. Research the built-in UUID handling capabilities of your database system to ensure you get the best possible performance from your solution.
Source link