In the new era of Big Data and Data Sciences, it is vitally important for an enterprise to have a centralized data architecture aligned with business processes, which scales with business growth and evolves with technological advancements. A successful data architecture provides clarity about every aspect of the data, which enables data scientists to work with trustable data efficiently and to solve complex business problems. It also prepares an organization to quickly take advantage of new business opportunities by leveraging emerging technologies and improves operational efficiency by managing complex data and information delivery throughout the enterprise.
When compared with information architecture, system architecture, and software architecture, data architecture is relatively new. The role of Data Architects has also been nebulous and has fallen on the shoulders of senior business analysts, ETL developers, and data scientists. Nonetheless, I will use Data Architect to refer to those data management professionals who design data architecture for an organization.
When talking about architecture, we often think about the analogy with building architecture. A conventional building architect plans, designs, and reviews the construction of a building. The design process involves working with the clients to fully gather the requirements, understanding the legal and environmental constraints of the location, and working with engineers, surveyors and other specialists to ensure the design is realistic and within the budget. The complexity of the job is indeed very similar to the role of a data architect. However, there are a few fundamental differences between the two architect roles:
Given all these differences, a data architect could still learn from building architects and, in particular, take their top-down approach to improve data architecture design. In many organizations, there has been a lack of systematic, centralized, end-to-end data architecture designs. Below lists some of the main reasons:
With these shortfalls, we often see a company with disjointed data systems and gaps between teams and departments. The disparities lead to the poor performance of the systems with many hand-offs, a long time to troubleshoot when a production data issue arises, a lack of accountability to reach the right solution across systems, and a lack of capability to assess the impact of a change. Lastly, the disjointed systems could cause tremendous effort to analyze and research when migrated or re-engineered to the next-gen platform.
Given all these, a successful enterprise needs to have a top-down coherent data architecture designed based on the business processes and operations. In particular, just like what a building architect does, an enterprise data architect needs to build a blueprint at the conceptual and logical level first, before applying the technologies to the detailed application designs and implementations.
1. Conceptual Level Data Architecture Design based on Business Process and Operations
In modern IT, business processes are supported and driven by data entities, data flows, and business rules applied to the data. A data architect, therefore, needs to have in-depth business knowledge, including Financial, Marketing, Products, and industry-specific expertise of the business processes, such as Health, Insurance, Manufacturers, and Retailers. He or she can then properly build a data blueprint at the enterprise level by designing the data entities and taxonomies that represent each business domain, as well as the data flow underneath the business process. In particular, the following areas need to be considered and planned at this conceptual stage:
This conceptual level of design consists of the underlying data entities that support each business function. The blueprint is crucial for the successful design and implementation of Enterprise and System architectures and their future expansions or upgrades. In many organizations, this conceptual design is usually embedded in the business analysis driven by the individual project without guidance from the perspective of enterprise end-to-end solutions and standards.
2. Logical Level Data Architecture Design
This level of design is sometimes called data modeling by considering which type of database or data format to use. It connects the business requirements to the underlying technology platforms and systems. However, most organizations have data modeling designed only within a particular database or system, given the siloed role of the data modeler. A successful data architecture should be developed with an integrated approach, by considering the standards applicable to each database or system, and the data flows between these data systems. In particular, the following 5 areas need to be designed in a synergistic way:
The naming conventions and data integrity
The naming conventions for data entities and elements should be applied consistently to each database. Also, the integrity between the data source and its references should be enforced if the same data have to reside in multiple databases. Ultimately, these data elements should belong to a data entity in the conceptual design in the data architecture, which can then be updated or modified synergistically and accurately based on business requirements.
Data archival/retention policies
The data archival and retention policies are often not considered or established until every late-stage on Production, which caused wasted resources, inconsistent data states across different databases, and poor performance of data queries and updates. To enforce the data integrity, data architects should define the data archival and retention policy in the data architecture based on Operational standards.
Privacy and security information
Privacy and security become an essential aspect of the logical database design. While the conceptual design has defined which data component is sensitive information, the logical design should have the confidential information protected in a database with limited access, restricted data replication, particular data type, and secured data flows to protect the information.
Data Replication is a critical aspect to consider for three objectives: 1) High availability; 2) Performance to avoid data transferring over the network; 3) De-coupling to minimize the downstream impact. Excessive data replications, however, can lead to confusion, poor data quality, and poor performance. Any data replication should be examined by data architect and applied with principles and disciplines.
Data Flows and Pipelines
How data flows between different database systems and applications should be clearly defined at this level. Again, this flow is consistent with the flow illustrated in the business process and data architect conceptual level. Besides, the frequencies of the data ingestion, data transformations in the pipelines, and data access patterns against the output data should be considered in an integrated view in the logical design. For example, if an upstream data source comes in real-time, while a downstream system is mainly used for data access of aggregated information with heavy indexes (e.g., expensive for frequent updates and inserts), a data pipeline needs to be designed in between to optimize the performance.
3. Continuous Governance is the Key to the Success of Data Architecture.
As data architecture reflects and supports the business processes and flow, it is subject to change whenever the business process is changed. As the underlying database system is changed, the data architecture also needs to be adjusted. The data architecture, therefore, is not static but needs to be continuously managed, enhanced, and audited. Data governance, therefore, should be adopted to ensure that enterprise data architecture is designed and implemented correctly as each new project is being kicked off.
Within a successful data architecture, a conceptual design based on the business process is the most crucial ingredient, followed by a logical design that emphasizes consistency, integrity, and efficiency across all the databases and data pipelines. Once the data architecture is established, the organization can see what data resides where and ensure that the data is secured, stored efficiently, and processed accurately. Also, when one database or a component is changed, the data architecture can allow the organization to assess the impact quickly and guides all relevant teams on the designs and implementations. Lastly, the data architecture is a live document of the enterprise systems, which is guaranteed to be up-to-date and gives a clear end-to-end picture. In summary, a holistic data architecture that reflects the end-to-end business process and operations is essential for a company to advance quickly and efficiently while undergoing significant changes such as acquisitions, digital transformation, or migration to the next-gen platform.