Data Architecture: How to build the castle?

HNL-gallery-900x400-03_0“Architecture is frozen music.” This famous quote is from 18th-century writer Johann Wolfgang von Goethe. The statement reveals a quality of architecture as a creative discipline. Both architecture and music are wide open to interpretations; however, they intrinsically bind things in harmony. Data architecture is a symphony of data that is collected, stored, arranged, integrated, and put to use in corporations. We dealt with the data architecture definitions and its need with house analogy in my last blog “Data Architecture – What is it and why should we care?”. In the current article, we will recount how to put together the fortress of data architecture.

Architecture frameworks such as TOGAF, Zachman, DoDAF offer us a method to think about systems and architecture. Although plenty of consortia developed, proprietary, defense industry, government and open source frameworks are available, one should use them judiciously because one might overdo things than necessary. There are many research papers available that show that EA frameworks are theoretical and impossible to carry out. With this in mind, experts agree that foundational artifacts are needed to document data architecture. Organizations decide these foundational set of artifacts based on the potential value they provide and the investment they have to make in creating them. These artifacts are integrated set of specifications to define data requirements, to guide integration and control of data assets, and to align with business’ information needs and strategy. An Architect must make sure the coherency and integrity between the artifacts created whether a diagram, data models, and other documents.

DAMA DMBOK divides Enterprise data architecture artifacts into three broad categories of specifications –

  1. The enterprise data model: The heart and soul of enterprise data architecture,
  2. The information value chain analysis: Aligns data with business processes and other enterprise architecture components, and
  3. Related data delivery architecture: Including database architecture, data integration architecture, data warehousing/business intelligence architecture, document content architecture, and meta-data architecture.

Enterprise architecture includes data, process, application, technology and business architecture in practice. The business architecture may include goals, strategies, principles, projects, roles and organizational structures. Process architecture has processes (functions, activities, tasks, steps, flow, products) and events (triggers, cycles). Application architecture has macro-level and micro-level application component architecture across the entire application portfolio governing the design of components and interfaces, such as a service-oriented architecture (SOA).  Enterprise architecture includes these aspects.


To create the data architecture, one has to define business information needs. The core of any enterprise data architecture is an enterprise data model (EDM). The EDM is an integrated subject oriented data model defining the essential data created and consumed across the enterprise. Building enterprise data model is the first mark in establishing that need and data requirement. Organizations cannot build EDM overnight. Each strategic, and enhancement project should contribute to building it piece by piece. Every project that touches data assets of the organization with its limited scope classifies the inputs and outputs required. These details should list data entities, data attributes, and business rules. One can thus organize these by business units and subject areas. Proper categorization and completeness is key to building the enterprise data model.

Planner View (Scope Contexts): A list of subject areas and business entities.

Owner View (Business Concepts): Conceptual data models showing the relationships between entities.

Designer View (System Logic): Fully attributed and normalized logical data models.

Builder View (Technology Physics): Physical data models optimized for constraining technology.

Implementer View (Component Assemblies): Detailed representations of data structures, typically in SQL Data Definition Language (DDL).

Functioning Enterprise: Implemented instances.

The enterprise data model by itself is not enough. The data model is part of the overall enterprise architecture. It is important to understand how data relates to business strategy, organization, process, application systems, and technology infrastructure. In forthcoming articles, we will go over EDM, information value chain analysis, data delivery architecture and some additional aspects of data architecture.


Data Architecture: What is it and Why should we care?

Basic tenets of Data architecture

Data Architecture, as understood by most in the industry, has many different definitions. Here is what Wikipedia says – “In information technology, data architecture is composed of models, policies, rules or standards that govern what data is collected, and how it is stored, arranged, integrated, and put to use in data systems and within the organization. Data is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture”. I would define Data Architecture as a discipline that deals with designing, constructing and integrating an organization’s data asset so it can be well optimized for the organization to run its business.

HDTS infographic
Data Architecture Vs House Architecture

Enterprise Architecture is often compared with the architecture of a house, which defines individual design elements of nook and corners of the house to build it to specifications. Similarly, the data architecture is a design that defines the way data enters the organization, lives in its systems and applications, moves within the organization and is consumed to run the business. It is a blueprint of the data design. Most enterprises deal with unmanageable data sprawl that continues to grow at tremendous speed. This “just a bunch of data”, or JBOD, is a major driving force behind the need for a data strategy and an enterprise data architecture.  It is much like a road network to reach to IT goal and thus business goal while data strategy defines “how” to reach that goal. In principle, the data architecture defines a framework that helps in organizing data ingestion, data storage & management, and data processing. Different components of this architecture would include data integration, DBMS selection, Data modeling, performance and measurement, security and privacy, Business intelligence, metadata, Data quality, and data governance.

The analogy of house design (see attached table) works well in this context to understand what components need to be taken care of when we talk about data architecture.

Data Architecture House Architecture
Policies, Rules, and Standards Code of house building
Policies, rules, and standards are the first thing required to build the house. One can use existing industry standard frameworks such as TOGAF, Zachman etc. These are the guard rails for a full life cycle of the data within the organization.
Data Subject Areas (Inventories) Naming the space e.g. living room, kitchen
Naming each space helps to know what types and categories of data are being used in the organization. It is the inventory of the data-space an organization has.
Data Models House plan
The data model is the actual diagram of various data entities at the conceptual, logical and physical level. These are the detail levels of data classifications that help with the collaborating, implementing, and testing of data specifications on the system.
Meta-data Room specifications
This is the lowest and the last level of detail about the data that describes the properties of the data being stored. This is where data context is added.
Integration Utilities hookup
This is where data movement between systems is handled. An integration plan includes what and how data is transferred and managed in-flight.
Data Residents of the house
Data is the resident of the house that lives and moves and is archived, deleted, and updated. A great data architecture plans the organic data growth for foreseeable future.

So why do we need a road network to reach our goals? Well…because we all value order than chaos. We like to follow etiquettes versus no rules of conduct. Okay… but what help will this “order” provide us and how? This question is simple to answer but rather difficult to execute. The enterprise data architecture helps us onboard the data quickly and delivers clean and trustworthy data at the speed required by patrons and business. It also ensures that data is handled in a more secure and compliant way, which may be required by local laws and regulations. It makes it easier to incorporate new data types and technology and enables end-user self-service. The list goes on, but the cardinal point is that data architecture ties it all together.

However, it is unwieldy to build an enterprise data architecture from scratch that can meet our need. A more pragmatic approach would be to build the future state of architecture in each new strategic business initiative. More in next issue…

Master Data Management: Cloud Vs on-premises

Cloud computing is not new to businesses anymore. The growth of cloud services, cloud data, and cloud usage continues unabated. More and more organizations are adopting cloud computing infrastructure these days to exploit the inherent benefit of cloud that doesn’t limit only to greater business agility but also expands to getting the best available technology of the time, and business efficiency along with some good economic value. Argumentatively cloud is more cleanly architected, free of baggage and legacy problems. However, cloud-MDM is something that did not gain enough momentum to be called a trend. According to Gartner, there are several inhibitors. Gartner claimed in its report “The impact of cloud-based Master Data Management Solutions” – “As a percentage of Gartner’s MDM inquiry calls with clients, interest in cloud-based MDM cluster around a maximum of approximately 6%. As the level of interests varies widely across client interactions, the rate of actual adoption is very likely to be far lower than 6%.”

We will explore these points in this article and analyze where are we on the cloud-MDM roadmap. If you are on your MDM journey and at the crossroads to decide whether cloud MDM is the one for your organization this article will help you find that out.

So, what are the differences between the two offerings namely cloud-based MDM and on-premises MDM? Cloud-MDM offers benefits of software as a service (SaaS) while not incurring a cost of the hardware and software associated with it and no maintenance to pay. The cost of provisioning hardware sometimes is prohibitive enough for the business to ditch the big capital expenditure (CapEx) in favor of other less critical investments. Subscription based licensing model of cloud helps organizations flatten their IT spending and adopt an operational expenditure (Opex) model. One of the many benefits of SaaS is increased operational agility in deploying the technology that enables the business to perform activities sooner to realize organizational goals. It also saves time and effort in infrastructure administration for the company if that is not what organization’s main strength is. Many organizations would like to focus on core business activities and services, it provides to its customers, rather than on services it consumes such as MDM, by letting expert professionals (cloud providers) manage it for them.

Architects and experts often argue about the security of the cloud and their reluctance to store the sensitive proprietary data outside company’s firewall. This argument could not be further from the truth. It is like arguing about keeping cash at home instead of in banks. Commonly understood fact that banks are more secure and have resources to make sure cash is safe with them is universal truth now. Although, there have been incidents happened in Bank but no one argues Banks are less safe than one’s home. In today’s world cloud providers, like Amazon and Microsoft, have deployed the best resources to make security their priority more than what one company can afford to do. With the number of data hacks all-time high in 2016, incidents of on-premises attacks highly outnumbered the incidents of cloud attacks and these cloud attacks had nothing to do with cloud’s security vulnerabilities. All the recent data breaches that happened at Target, Home Depot in 2014 or Apple iCloud hack were human errors that cloud cannot protect one from. However, efforts should be made to select an established, certified, compliant cloud provider, who takes security very seriously and willing to go an extra mile to protect customer’s data.

Commissioning on-premises MDM would require companies to budget maximum projected capacity of the hardware. With cloud MDM, the organizations can elastically scale up and scale down processing, storage and software use licensing. Any new SaaS application required by the organization can be fired up very easily on the cloud compared to on-premises. Usually, enterprises need applications for their customer relationship management (CRM), Enterprise resource planning (ERP) and other back-office efforts. Most of the prominent applications such as Salesforce CRM and NetSuite ERP are already on the cloud. Cloud is where integration belongs. One may, however, argue that adoption of cloud MDM would result in a complex set of point-to-point integration of applications across their firewalls if they lack a mature data integration architecture. While this could be a concern but also a huge opportunity for the organization to think seriously about having a centralized integration point (governed data integration layer) for its cloud integrations if it wants to walk cloud first strategy. However, a piecemeal approach would be good for organizations that have large and complex operational system landscape to deal with, whether on-premises, on cloud or mix of both.

Cloud-based SaaS application delivers new functionality on regular basis e.g. Informatica MDM release new versions twice a year. The organization may decide when to upgrade and new features would be available overnight. This throws a great savings potential with no impact upgrade. However, backward compatibility may not be available or provided by the MDM vendor for previous production releases or may be available only for a limited number of old releases. In that case, the organization should be prepared to move quickly to upgrade other integrated technologies to catch up with the newest version of cloud MDM in time. Updated technology or newer infrastructure will be at organization’s disposal as soon as it is available on the cloud in a transparent way.

Organizations those operate in the global environment and have customers across the globe should keep local data regulations and security laws in mind while deciding the cloud MDM. Not all cloud providers offer cloud hosting outside the US. Some of them started thinking about it very recently with a backdrop of EU General Data Protection Regulation (GDPR) coming into effects from May 2018 but most of them are still in process of getting ready. In addition to that, the ruling from the Department of Justice in October 2015 that invalidates the US safe harbor agreement poses another problem for companies storing and transferring EU citizen’s data outside Europe. Hosting local customer’s Personally Identifiable Information (PII) data within the country boundary may be a requirement for your company. Analyzing PII data elements your organization would like to keep on Cloud MDM, to make sure your organization needs to follow country specific data residency requirements, would help to determine local cloud hosting requirements. If the answer is “Yes”, organizations should find a cloud provider that offers a locally hosted cloud-MDM in that country/Region. Some companies such as Amazon, have been more proactive than others, and have used EU-US privacy shield framework that allows US companies to live up to the requirements of the European Court of Justice.

A holistic assessment of the drivers and inhibitors are necessary to decide that cloud MDM is aligned with organization’s long-term business and IT strategy. Most corporates, when in doubt, fall back on the on-premises option because it is tried and tested, and has been around for years, if not decades. Nothing wrong with it if your business sees high risks in adopting the cloud model. For further reading on how to develop an assessment score card based on sensitivity matrix and to develop “what if” scenarios, visit Gartner’s report – “Five factors for planning Cloud-Enables MDM”.

Lambda Architecture

Nathan Marz, who also created Apache storm, came up with term Lambda Architecture (LA). Although there is nothing Greek about it, I think it is called so, primarily because of its shape. It is a data processing architecture designed to handle massive data quantities of data by taking advantage of both batch and stream processing methods. LA is an approach to building stream processing applications on top of map reduce or storm or similar applications. This architecture has become very popular in big data space with companies such as LinkedIn, twitter and Amazon.


Lambda Architecture pattern solves the problem of speed on Big data and is suited to applications, where there are delays in data collection, and availability through dashboards, requiring data validity for online processing for older data sets to find a behavioral pattern as per users’ needs. One of the basic requirement for LA is to have an immutable data store, which appends the data instead of the following update and delete as part of CRUD operations. But the downside of this immutable data store is that batch processing is not real time. Although the batch processing will improve with time, it is also true that the volume of data grows at the same pace, if not faster. Applications for BI or delivery layer expect to access the data real time, and cannot rely entirely on batch processing to finish up.

The way it works is that an immutable sequence of records is captured and fed into the batch system, and stream processing system in parallel. The transformation logic is applied twice to both processing systems – once in batch and once in stream processing. The result is then stitched together from both the systems at query time to present final answer.

So why there is so much buzz about Lambda Architecture these days. Well…the reason most likely is because of the data space becoming more complex and business expectations of quick data insights raised, there is a need to build low latency processing systems. What we have at our disposal is are scalable high latency batch system that can process historical data and a low latency stream processing system that can process results. By merging these two solutions we can actually build a workable solution.