Guest blog post by Charlie Silver, CEO of Algebraix Data. Originally entitled 'Data Algebra Does Big Data'.
Algebra is powerful. It enables people to solve for unknowns and frame problems in ways that are universally understandable. For the same reason, data algebra is powerful. Why? Because it can represent data – all data – mathematically.
What is Data Algebra (and when do I use it)?
Data algebra starts small. It designates the fundamental unit of data as a couplet, which you can think of as a value (for example, “28”) associated with a qualifier (for example, “countries”). The value alone has no meaning, but by attaching a qualifier – another item of data that reveals the meaning of the value – you have a couplet, a structure that is well-defined in mathematical terms and can readily be treated mathematically.
If you write the couplet as (28, countries), it might indicate that there are 28 countries in the European Union, as indeed there are. But it might not. It might indicate that there are 28 countries in NATO. Or that you have visited 28 different countries in the last decade, and so on.
In other words, to add context to the data you need to qualify the couplet again. And to store the data in a computer, you need to add another qualifier that says where the stored data is located, so you can retrieve it whenever you need to.
Mathematically, the unit of data doesn’t get much more complicated than that. And from a mathematical perspective, it’s useful because it can rigorously define data. But if all you’re dealing with is a few items of data, or even a few hundred items, defining it mathematically is overkill. That amount can simply be written down in a document.
When the numbers go up, and the relationships between the various types of data get a little more complicated, applying math is still unnecessary – a spreadsheet can manage the task. Not only can a spreadsheet store larger amounts of data but it lets you manipulate the data in useful ways. Nowadays, a spreadsheet can easily accommodate 100,000 rows of data.
You can perform various mathematical operations on data in a spreadsheet, such as counting the occurrence of particular values, grouping it in various ways, adding up values, and more, but this is not the same as defining and manipulating it algebraically.
When it comes to graphical data – that is, data expressing specific relationships between data entities – a spreadsheet is less useful, even for relatively small volumes of data. But there’s another option: switching to a graph database. This approach lets you process graphical data in productive ways. In this sense, a graph database is not that different from a spreadsheet because both provide useful capability.
Think of these situations as managing “little data,” and the software that exists right now is good enough for using relatively small amounts of data productively.
Managing “Big Data”
Data algebra becomes applicable when data complexity and volumes start to sharply increase. That is, when you’re dealing with Big Data. For example, let’s say you want to select a set of data from a large database. Data algebra can define the data file precisely, and then define the query you want to run against the data precisely, and finally deliver the answer precisely – and do it all rapidly.
These processes are what database software is designed to do, by employing statistical techniques and clever algorithms that try to determine the fastest way to get the data. However, there’s a limit to how much database software can do. As time goes by and data volumes get larger, older database products run into trouble because of assumptions that were made in their design. The nature of hardware changes. The speed of CPUs change. The speed of memory changes. And storage changes (witness the recent emergence of solid state storage). Older software has trouble keeping up with all these changes, so new database software has to be developed.
Today, there are well over 200 different database software products that run the gamut from very old to very new. Regardless of age, all these products are trying to solve the same problem: how to store and retrieve Big Data as quickly and efficiently as possible.
The Difference That Makes a Difference
If you tried to write a job description for a database, it becomes clear that it has to solve multiple problems:
That’s a lot of difficult and important problems to solve, which is why using data algebra can make such a significant difference. By representing data algebraically, you can define everything in the computer domain mathematically, including the capacity and speed of hardware, the speed of software, the workloads being executed, the service level required for any given transaction, and so on.
Data algebra covers everything with mathematics, and this makes is possible to build software that is optimized for specific situations because you can prove it mathematically.
The fact is that even the most talented software engineers will be outdone by mathematics. Big Data may seem massive right now, but it’s actually in its early days. As data volumes and problems grow, a mathematical approach will become a necessity. Math already dominates in other spheres of engineering, and it is only a matter of time before it dominates the engineering of Big Data software. The algebra of data will become the foundation for the data economy.
Comment
I would reframe the contest as being between reductionist theorists or systems theorists.
"Math or Engineers -- who will solve the big data problem?" - interesting proposition? Knowledge of engineering and or maths does help. Based upon my own personal experience - the most essential know-how that is required to address the big data problem is to have the DOMAIN knowledge of the subject matter being addressed coupled with the software development expertise and database experience; otherwise big data projects would fail. This is my advice worth 2 cents.
There was an article "Artificial intelligence, logic, & formalizing common sense” (1989). In article, the need for logic in computer sciences was stated. There is a question. Who will write the algebra for data? An algebra is for analysis and not so for analytic.
Sione ... I appreciate your reply, but the data algebra discussed in this post isn't about matrix algebra.
Yes, majority of techniques in machine learning involve matrix algebra.
Are there case studies to support the claims about the benefits of data algebra?
The use of algebra in predictive analytics and dimensionality reduction is a strong indication that the world of big data cannot isolate itself from mathematics. Dimensionality reduction relies on the principle of matrix algebra to project high volume data in low dimensional space for proper visualization.
I agree with the title that Algebra is powerful tool in analytics, whether its machine learning, statistics, bioinformatics, engineering or physics analytical task, where a large number of the techniques adopted have underlying matrix factorization operations as SVD, Cholesky, LU, QR, EigenDecomposition, etc, which they are core topics in algebra.
© 2020 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Upcoming DSC Webinar
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Upcoming DSC Webinar
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central