Data storage is often oversimplified at the application development level. At the very least, whatever database management system (DBMS) is in use should consume and emit rows of data on command. However, when in use, the size of a database and the specific functionality overlaying application(s) often bring about new challenges and ideas. Of particular interest is the column store versus row store conundrum.
The bigger a database grows in relation to the access memory allocated to individual app nodes, the more focus is placed on how data is stored. In such cases, architectural changes to data stores can lead to impressive results. This brings us to the focus of this post—exploring how storing data on a column basis differs from the mainstream row storage approach. We’ll also review use cases that count when making the decision to use either column store versus row store.
Using a columnar data storage form inherits attributes to a database that were once thought impossible. We’ll get to these shortly. For now, let’s open by deconstructing the notion of column data storage.
Columnar Storage: The Concept
The best way to explain how column store works would be from the moment of data entry. Think of how you would store the details of a website in a database. SQL or NoSQL, the table or collection to fill classifies various attributes of any website into columns. These could be domain, owner, domain authority, Date published, and more.
Saving a batch of websites into such a database’s tables typically involves filling out each column with a single website’s details. You’d thus move from one column to the next. This is how you end up with a row store.
Reading such data follows the same format. Your applications would read the data from left to right as you would with a book.
In contrast, the column store approach acknowledges that each entity has columnar attributes and groups together the total column details. This allows entry into a table as though each column stood apart from the rest.
This method is self-indexing, which means that no extra storage is taken up when entering data.
Storing data this way comes with advantages and disadvantages. Let’s explore a few of each.
Advantages of Columnar Storage
Quicker Data Lookup Turnaround Times
When searching for data, every SQL command follows this syntax:
Select website_name from websites_table where domain like %.org%
When executed, this and many other variations of SQL queries analyze the entire database, from the first row down to the last. However, they only print out the column (website_name) values that match the attribute specified (.org domains). Keep in mind that all reading will happen following the same way data was saved: from left to right. This way of running searches wastes a lot of time clearing every column that doesn’t fit from the search results.
When a database is arranged in a column store basis, the search process is completely different. Once the column to look under is specified, the rest is intentionally overlooked. Already, this is a time-saving maneuver. When searching through the columnar database, every column value matching the query arguments is returned. Since the scan happens vertically, the first value found is temporarily assigned index 0. Matches that follow take on proceeding index values. This helps match any corresponding return-value pairs from different columns in the same table.
Saving Space Through Compression
The fact that every column validates data before storage implies uniform types from top to bottom. This is an opportunity for column store databases to use compression. This doesn’t happen with row store databases since any single block of data has as many data types as specified in the schema.
Another space-saving aspect in columnar databases results from data denormalization. Instead of having two tables that separate data about the same entities, columns can exist in the same space. This drops columns like foreign keys, which only got to be as a result of normalization.
Disadvantages of Columnar Storage
Slow Data Entry Process
The fact that data pours into the storage space on a column basis makes the process slow in comparison to row store. Small applications that take up data on a single line-item basis don’t show this disparity as much as data migration or upload processes. This delay is particularly noticeable when handling OLTP processes.
Low Interconnection of Results
When you’re looking for database search results that show interconnected columns of data, you’re best using row store methods. For example, the result set [website name, website URL, date published, last updated] connects most (if not all) columns in a database. This forms the same row that would have been entered the row store way. Fetching and displaying these sorts of results almost always happen faster with row store than column store databases.
You won’t find a database that supports both methods, at least for now. As such, if you had to pick one database storage approach between row and column, you’d have to consider specific use cases.
How to Choose: Column Store vs. Row Store
With the shortcomings of column store exposed, it makes sense why some microservice setups place it at the front end of their applications, while relational row store databases dominate the core function of said applications. Such a decision only comes to be after careful analysis of how much of a fit each option is in your situation. Let’s discuss a few such cases for both sides.
Row Store Use Cases
The following application areas make row store database options the best option.
Case 1: Transactional Applications
Transactional storage requirements are those that associate all columns in a table row to the state of a data object. MySQL databases are typically used for these applications. As an example, consider how the WordPress CMS handles data. Every object is somehow connected to the rest of the instance. So much that changes in the relational database affects even the front end.
Case 2: Modest Dataset Request Situations
Databases should have fixed views for users to access stored objects. When you run queries at any given time in the course of using row store databases, the results are often entire row details. That’s not to say that you can’t have entire pages of data spewed out from the universal. But such operations execute at optimal speed when the data to retrieve is stored in rows.
Column Store Use Cases
On the other end of the spectrum lie column store databases. These also have use cases for which they make perfect sense.
Case 1: Rapidly Scaling Datasets
You’re better off going the column store route if your database will inflate in a short period of time. This improves data agility for rapidly growing databases. The growing size doesn’t eat into query response speed.
Case 2: BI and Analytical Applications
Business intelligence and other analytical tools (OLAP) perform at their best when hooked to column store databases. The analytical nature often means drilling down to a single column data value and rendering trends over time. Such an operation should happen at real-time speed. This feat is only possible when less time is spent scouring an entire database each time dashboard frames refresh.
Applications known to use column store databases include Scalyr, a log management tool optimized for fast search and visualization of app metrics.
The Takeaway
Before you declare which database management system you should use with your apps, first consider how either column or row storage affects performance. Sometimes saving data on a column basis results in a better overall user experience. Yet sometimes, the left-to-right row store approach works best. Either way, both present many database options to choose from. The speed of search results, how quickly you intend to scale, and the interwoven nature (or degree) of your data all come into play. Better yet, you could find a use for both types and turn the column store versus row store argument on its head.
This post was written by Taurai Mutimutema. Taurai is a systems analyst with a knack for writing, which was probably sparked by the need to document technical processes during code and implementation sessions. He enjoys learning new technology and talks about tech even more than he writes.