What is data governance?
Data governance includes the processes and policies that ensure data is in the proper condition to support business initiatives and operations. Modern organizations collect data from various sources at scale to enhance operations and service delivery. However, data-driven decision-making is effective only when data meets required quality and integrity standards.
Data governance determines roles, responsibilities, and standards for data usage. It outlines who can take what action, upon what data, using what methods, and in what situations. With more data being used to support artificial intelligence (AI) and machine learning (ML) use cases, it has become critical that all data usage meets regulatory and ethical requirements. Data governance balances data security with tactical and strategic objectives to ensure maximum effectiveness.
Why is data governance important?
Data governance programs have historically been employed to lock down data in silos to prevent data leakage or misuse. However, the consequence of data silos is that legitimate users must navigate barriers to get access to data when they need it. Inadvertently, data-driven innovation gets stifled.
In a 2024 survey of 350 CDOs and CDO-equivalent roles, MIT CDOIQ found that 45% of Chief Data Officers identify data governance as a top priority. These data leaders want to establish a data governance framework that lets them make data available to the right people and applications when needed while keeping the data safe and secure with appropriate controls in place.
Balances access and control
You have two levers to make governance an enabler of innovation: access and control. The key to success is finding the right balance between the two—each organization's balancing point is different. When you exercise too much control, the data gets locked up in silos, and users are not able to access the data when they need it. This stifles creativity and leads to the creation of shadow IT systems that leave data out of date and unsecured. In contrast, when you provide too much access, data risks becoming unregulated across applications and data stores, increasing unauthorized access risk and impacting data quality.
Data governance processes balance access with control, giving users trust and confidence in the data. They promote appropriate discovery, curation, protection, and data sharing, encouraging innovation while safeguarding the data.
What are the benefits of data governance?
Data governance offers a structured framework for managing data across an organization. Here are some key benefits.
Improves data quality
Data governance establishes standards for data accuracy, completeness, and consistency. You get relevant, current, easy-to-interpret data that is trusted by all stakeholders. This high-quality data reduces errors and generates accurate and timely insights for strategic and operational decision-making.
Supports data-driven culture
An effective data governance strategy fosters a culture that values data, encouraging all employees to use and understand data in their work. It motivates business community participation and drives data integration across participating business areas. Alignment between data engineers and business users boosts the organization’s overall data literacy and analytical capabilities.
Increases operational efficiency
Data governance helps to determine the right operating model, especially the level of centralization and decentralization required. You can establish consistent data management practices that streamline operations. Clearly defined data ownership and access rights facilitate collaboration across departments, ensuring everyone works with the same, reliable data sources. Align efforts across teams to reduce duplication, lower operational costs, and improve productivity.
Supports regulatory compliance
Data governance frameworks take a proactive approach to risk management, ensuring that data practices align with legal and industry regulations. You can prevent unauthorized access by centrally defined policies for who can access or modify data. Data governance tools support compliance with privacy regulations to protect sensitive data.
Who builds data governance?
Building a robust data governance strategy requires many job functions.
Executive sponsors
They identify and establish data governance principles, standards, and policies across the organization. They also understand many business initiatives on the corporate roadmap and can help determine priorities to drive data governance activities.
Data stewards
They are from the business and are involved in the day-to-day details of projects. They help understand the data issues that are likely to cause challenges with targeted business initiatives. They also implement the data governance process in their projects and ensure data is managed appropriately. They monitor employee and customer compliance and escalate any issues if they arise.
Data owners
They make policies about the data, including who should have access to it and under what circumstances, how to interpret and apply regulations, and key term definitions. They are also responsible for your data sets' technical administration and access controls.
Data engineers
They are from IT and select and implement the best data governance tools to secure data, integrate data from various sources, manage data quality, and find the right data.
What are the styles of data governance?
Your data governance program should balance centralization and decentralization (including self-service). Throughout your organization, you’ll have a mix of centralized, federated, and decentralized governance—again, depending on the business requirements. You should empower domain teams as much as possible while maintaining coherence across domains (such as the ability to link data together).
Centralized data governance
Central organizations are ultimately responsible for mission statements, policies, tool choices, and more. However, day-to-day actions are often pushed into lines of business (LOB).
Federated data governance
Federated data governance empowers individual business units or initiatives to operate in the way that best matches their needs. However, a smaller centralized team focuses on solving problems that repeat frequently, including enterprise-wide data quality tools, for example.
Self-serve or decentralized data governance
Each department does what it needs for the specific project while aligning with centralized policies. Each project uses any tools or processes from other projects where there is a fit-for-use. As topics like data mesh (itself decentralized) increase in popularity, so does self-service data governance.
How does data governance work?
Data governance requires people, processes, and technology solutions across a range of capabilities.
Curate data at scale to limit data sprawl
Curating your data at scale means identifying and managing your most valuable data sources, including databases, data lakes, and data warehouses. You can limit the proliferation and transformation of critical data assets. Curating data also means ensuring that the right data is accurate, fresh, and free of sensitive information so users can have confidence in data-driven decisions and the data feeding applications.
Capabilities: Data quality management, data integration, and master data management
Discover and understand your data in context.
Understanding your data in context means that all users can discover and comprehend the meaning of their data so they can use it confidently to drive business value. With a centralized data catalog, data can be found easily, access can be requested, and data can be used to make business decisions.
Capabilities: data profiling, data lineage, and data catalogs
Protect and securely share your data with control and confidence.
Protecting your data means striking the right balance between data privacy, security, and access. It’s essential to govern data access across organizational boundaries, using tools that are intuitive for both business and engineering users.
Capabilities: Data lifecycle, data compliance, and data security
Reduce business risk and improve regulatory compliance.
Reducing risk means understanding how that data is being used and by whom. AWS services help you monitor and audit data access—including access through ML models to help ensure data security and regulatory compliance. Machine learning also requires auditing transparency to ensure responsible use and simplified reporting.
Capabilities: usage auditing for data and ML
What are data governance best practices?
The key to effective data governance is to attach to already-funded business initiatives. Make sure your team understands which data domains, sources, and elements are needed to support those initiatives.
- Build a data governance roadmap that shows support for targeted business initiatives. Then start to identify data overlap between chosen business initiatives.
- Identify applications and business intelligence use cases that the data needs to support and feed, including requirements for freshness and privacy.
- Understand what fit-for-purpose data looks like for each chosen business initiative.
- Sustain and expand by embedding governance in the enterprise operating model so data planning and implementation become a natural part of the operation of the organization.
- Organize the analytics community for self-service and consistency.
- Support artificial intelligence (AI) and machine learning (ML) with data governance and ML governance. Use the same data governance program but extend to feature stores and ML models.
How does data governance impact analytics, machine learning, and artificial intelligence?
Data governance plays a key role in data-heavy use cases.
Analytics governance
Analytics governance is both—governing data for use in analytic applications, as well as governing usage of analytics systems. Your analytics governance team can establish governance mechanisms, such as analytics report versioning and documentation. As always, keep track of regulatory requirements, establish company policy, and provide guardrails to the broader organization.
AI governance
AI governance applies many of the same data governance practices to AI/ ML use cases. Data quality and integration must provide the data required for model training and production deployment (feature stores are one important aspect of this). Responsible artificial intelligence (AI) is paying special attention to using sensitive data for building models. Additional AI governance capabilities include enabling people to participate in model building, deployment, and monitoring; documenting model training, versioning, and supported use cases and guiding ethical model use; and monitoring the model in production for accuracy, drift, overfitting, and underfitting.
Generative AI requires additional data governance capabilities, like data quality and integrity, to support the adaptation of foundation models for training and inference, governance of Generative AI toxicity and bias, and foundation model (FM) operations: FMOps.
You can support AI/ML with the same data governance program. Data preparation is necessary to transform data into a form that AI/ML models can use for training and production inference—but the most efficient data preparation is the preparation you don’t have to do. Data scientists spend too much time preparing data for each use case—your data governance team can help alleviate this undifferentiated heavy lifting. In addition, data governance can oversee the creation of shaped feature stores for AI and ML use cases.
Finally, sensitive data must be protected appropriately so your team can mitigate the risks of using sensitive data to train the foundation models.
Much like analytics, you have to govern the use of AI/ML models you build or customize. Ideally, this should be closely associated with analytics governance, because that function will know how to support various business areas.
What are the main data governance challenges?
The most common strategic challenge for data governance is to align your program to business initiatives instead of proposing the value of data governance directly. For example, you might propose the value of making it easier for end users to find the data they’re looking for, or you might propose the value of resolving data quality issues. But these are solutions in search of a problem. If you do it this way, you’ll end up competing for funding and sponsorship with business initiatives you should be supporting. Instead, position data governance to support business initiatives. Every major business initiative requires data. Data governance should ensure the data is in the right condition to support the success of the business initiative. Don’t overlook reporting and auditing practices for how data governance supports these initiatives.
Another common strategic challenge is to avoid applying data governance too narrowly. A too-narrow definition could mean aligning the program with individual business areas or use cases without taking a wider view across business areas. A narrow definition could also mean defining data governance by only one or two capabilities. For example, having a data catalog does not constitute a data governance program.
What are the AWS offerings for data governance?
With end-to-end data governance on AWS, organizations have control over where their data sits, who has access to it, and what can be done with it at every step of the data workflow. Data governance with AWS helps organizations accelerate data-driven decisions by making it easy for the right people and applications to securely and safely find, access, and share the right data when they need it. You can curate data by automating data integration and data quality to limit the proliferation of data. You can discover and understand your data with centralized catalogs that boost data literacy. You can protect your data with precise permissions that let you share data with confidence.
You can reduce risk and improve regulatory compliance by monitoring and auditing data access.
- Amazon DataZone – unlock data across organizational boundaries with built-in governance
- AWS Glue – discover, prepare, and integrate all your data at any scale
- AWS Lake Formation – build, manage, and secure data lakes in days
- Amazon QuickSight unified business intelligence at hyperscale
- Amazon SageMaker – build, train, and deploy machine learning models for use cases with fully managed infrastructure, tools, and workflows
- ML governance web page
- Amazon Bedrock – build and scale generative AI applications with foundation models (FMs)
- Amazon Macie - discover and protect sensitive data at scale
- Amazon Simple Storage Service (Amazon S3) access points – object storage built to retrieve any amount of data from anywhere
- AWS Data Exchange – easily find, subscribe to, and use third-party data in the cloud
- AWS Clean Rooms – create clean rooms in minutes to collaborate with your partners without sharing raw data
Get started with Data Governance on AWS by creating a free account today.