We live in a world of data. There’s even data about data, called metadata. And all this data needs to be stored in databases. Some database systems are basic tabular files, such as Excel. On the other end of the spectrum are ultra-high performance database systems used by massive social media platforms. Twitter sends half a billion tweets per day which equates to 6,000 tweets per second. That’s a lot of data that needs to be stored as efficiently as possible.
Databases are typically broken down into tables. One table might store customer data while another table stores a customer’s orders. The connection between customer data and their orders is formally known as a relationship. To minimize needlessly redundant data a category of databases was created, called relational database management systems (RDBMS). A RDBMS is a digital data collection for managing tabular data using a structure and language. A RDBMS structure is known as a schema and the language used to interact with a RDBMS is typically the Structured Query Language (SQL).
When developing a RDBMS, it is critical that the database’s schema properly models how database tables relate to each other. This is known as database normalization which removes redundancies.
Database tables track unique rows of data using a primary key. A primary key is the equivalent of the integer row numbers in an Excel spreadsheet. A RDBMS should start off with normalized data, which means only keys are duplicated in separate tables – the rest of the business data should be unique to each table. Later, when there's a performance issue, the database can be optimized, as needed, by denormalizing the data.
Here are some basic properties of a RDBMS when developing a database schema.
1. Primary keys: Never build intelligence into a primary key – a primary key is simply an artifact of the database and it should represent nothing more than a way to access a row in a database table (i.e., don't use SSN as a primary key). Creating a primary key that's a simple integer is highly efficient since a computer can quickly find and compare numbers (in the case of integers) much faster than a string of nine characters (in the case of SSNs) or 16 characters (in the case of a universally unique identifier, better known as a UUID).
2. Table Names: Database table names should be singular (Employee, Order, Transaction, Statistic, etc). They should be named for what each row in the table represents, not the entire collection. The reason is that, typically, there's a one-to-one mapping between a row in a database table and an object used in code. For example, in code, an instance variable referencing an Employee object should represent a single employee from the database while an instance variable that's plural, such as Employees, should represent a collection of objects such as an array or dictionary.
3. Lookup Tables: A lookup table is a simple static database table that's used to populate a list or collection. For example, a list of countries that a company ships to. Perhaps, the company only ships to the U.S. and Canada. Later, when the company starts shipping to more countries, how does one update the pull-down menu of countries on the website or mobile app? With a lookup table, one simply adds another row to the table with the new country. Updating the database table is easier than changing the code, recompiling, and deploying. Additionally, a look up table also has a column representing a sort order. This is done so the list can be displayed in a specific order with, say, the U.S. listed first, instead of Afghanistan, if most of the customers are located in America.
4. Compound Primary Keys: A database table should have a single primary key for a typical one-to-many relationship to another table. Sometimes, it's necessary to have a many-to-many relationship. For example, a Person table related to an Address table. A person might have multiple addresses (homes), and an address might belong to multiple people. In these cases, where a many-to-many relationship is needed, then a simple middle table is set up with only two columns which contain two primary keys propagated from the two joining tables. One of the primary keys in the middle table is the primary key of the Person and the other is the primary key of the Address. Technically speaking, the two primary keys in the middle table are propagated foreign keys.
I am not aware of a practical case where more than two primary keys are needed in a database table. In cases where I have seen three (or more) primary keys in a database table, I realized that the database designer didn't have a good understanding of relational databases. What that designer typically needed was a single primary key, and indices created for their other columns, to optimize their lookup speeds.
5. Number vs Varchar: Do not use a numeric type for defining data fields which won't be used for calculations. In other words, credit card numbers, phone numbers, SSNs, etc., should be string types (i.e. varchars) in a database’s schema. One specific problem I have encountered on a production system is when a developer stored the credit card security code (CSC) as a numeric data type. Although this credit card code is always numeric, it can contain an important leading zero. When I saw my CSC repeatedly failing at checkout on an e-commerce website, I immediately knew the problem and confirmed it by reaching out to the database administrator (DBA). Understanding these fundamental best practices will serve a programmer well in their efforts to create effective, efficient, and understandable code that works - whether for the data management of the mom-and-pop store down the street or the enormity of data in Twitter’s latest trending tweets.
You can watch the full instructional video on the CANA Youtube Channel here: https://www.youtube.com/watch?v=xnkJ7a1QI_E
Joe Moreno is a Director of Development at CANA Advisors. You can follow him at joemoreno.com or contact via email firstname.lastname@example.org.