Technical Glossary
A centralized storage repository that holds vast quantities of raw data in its native format until needed for analysis, supporting structured, semi-structured, and unstructured data types at any scale. Data lakes decouple storage from compute, enabling diverse processing engines to query the same underlying data without transformation overhead. Schema-on-read approaches allow organizations to defer data modeling decisions while preserving full data fidelity. NIST and IEEE frameworks provide reference architectures for enterprise data lake governance, security, and lifecycle management.
An automated sequence of data processing steps that moves information from source systems through transformation, validation, and enrichment stages to target destinations such as data warehouses, analytics platforms, or operational systems. Pipelines implement extract-transform-load or extract-load-transform patterns with configurable scheduling, error handling, and data quality checkpoints. Modern implementations leverage streaming architectures for near-real-time processing alongside traditional batch workflows. NIST data engineering guidelines and IEEE publications address pipeline reliability, observability, and scalability requirements.
A comprehensive framework of processes, governance, policies, and technology for defining and managing an organization's critical data entities to provide a single, authoritative source of truth across all business systems. MDM programs address entity resolution, data matching, survivorship rules, and golden record creation for core domains including customers, products, locations, and financial hierarchies. Implementation models range from registry-style approaches to consolidated hub architectures. ISO 8000 and NIST data quality standards provide the governance foundation for enterprise MDM initiatives.
The documented trail of data as it moves through an organization's systems, capturing its origins, transformations, dependencies, and consumption points throughout the data lifecycle. Data lineage provides visibility into how data is created, modified, aggregated, and consumed, enabling impact analysis, regulatory compliance, and root cause investigation for data quality issues. Automated lineage capture systems parse ETL code, query logs, and API interactions to build dynamic dependency graphs. W3C PROV-O ontology and ISO metadata standards define interoperable representations for data provenance tracking.
A decentralized sociotechnical approach to data architecture that organizes data ownership and architecture around business domains, treating data as a product managed by cross-functional domain teams. Data mesh principles include domain-oriented ownership, data as a product mindset, self-serve data infrastructure platforms, and federated computational governance. The architecture addresses scalability limitations of centralized data team bottlenecks by distributing accountability to domain experts. IEEE and ACM research publications explore implementation patterns and organizational transformation requirements for data mesh adoption.
Cryptographic methods and protocols used to protect data confidentiality by transforming plaintext into ciphertext using mathematical algorithms and encryption keys, rendering information unreadable to unauthorized parties. Encryption standards address data protection at rest, in transit, and in use through symmetric and asymmetric cryptographic schemes. Modern implementations employ AES, RSA, and elliptic curve cryptography with key management infrastructure for enterprise-scale deployments. NIST publishes the authoritative cryptographic standards including FIPS 197 for AES and SP 800-175B for cryptographic implementation guidelines.
A data integration process that extracts data from heterogeneous source systems, applies transformation logic including cleansing, deduplication, and schema mapping, then loads the processed results into a target data store such as a data warehouse or operational data store. ETL workflows manage data type conversions, business rule application, referential integrity enforcement, and incremental change capture. Enterprise ETL platforms provide visual workflow designers, scheduling engines, and monitoring dashboards. IEEE and NIST publications address ETL performance optimization, error handling, and data quality assurance practices.
An organized inventory of an organization's data assets that provides searchable metadata, schema documentation, usage statistics, and quality metrics to enable data discovery and understanding across the enterprise. Data catalogs facilitate self-service analytics by indexing table structures, column descriptions, data classifications, and ownership information with automated profiling capabilities. Advanced catalogs incorporate machine learning for automated tagging, relationship inference, and usage-based recommendations. W3C DCAT vocabulary and ISO metadata standards define interoperable catalog structures for data asset registration and discovery.
A centralized analytical data store optimized for query performance and historical analysis, integrating data from multiple operational sources through structured schemas designed for reporting, business intelligence, and decision support workloads. Data warehouses employ dimensional modeling techniques including star and snowflake schemas to organize facts and dimensions for efficient aggregation and drill-down analysis. Modern cloud data warehouses provide elastic compute scaling, columnar storage, and semi-structured data support. IEEE and ACM research establishes the architectural foundations and optimization techniques for enterprise data warehousing.
The set of practices, methodologies, and technologies used to measure, improve, and maintain the accuracy, completeness, consistency, timeliness, and validity of organizational data assets throughout their lifecycle. DQM programs implement automated profiling, validation rules, anomaly detection, and remediation workflows integrated into data pipeline operations. Key dimensions include accuracy, completeness, uniqueness, timeliness, validity, and consistency measured against defined business rules. ISO 8000 and NIST data quality frameworks provide authoritative guidance for establishing enterprise data quality measurement and improvement programs.
The continuous processing and analysis of data in motion as it is generated, enabling real-time pattern detection, alerting, and decision-making without the latency of traditional batch processing cycles. Streaming analytics engines process unbounded data streams using windowed aggregations, event correlation, and stateful computations at sustained high throughput. Applications include fraud detection, network monitoring, IoT sensor analytics, and real-time personalization. IEEE and ACM publications define the theoretical foundations and architectural patterns for distributed stream processing systems.
A comprehensive set of policies, technical controls, and organizational measures that govern the collection, processing, storage, and sharing of personally identifiable information in compliance with applicable privacy regulations and ethical standards. Privacy frameworks implement data minimization, purpose limitation, consent management, and subject rights fulfillment across organizational systems. Technical controls include anonymization, pseudonymization, differential privacy, and access governance mechanisms. NIST Privacy Framework and ISO 27701 provide structured approaches to privacy risk management and compliance demonstration.
A structured representation of real-world entities, their attributes, and the relationships between them organized as a network of interconnected nodes and edges with semantic typing and contextual metadata. Knowledge graphs enable intelligent data integration, semantic search, recommendation engines, and natural language understanding by providing machines with contextual knowledge about domain concepts. Construction techniques include ontology modeling, entity extraction, relation linking, and graph embedding. W3C RDF, OWL, and SPARQL standards define the foundational specifications for knowledge graph representation and querying.
A hybrid data architecture that combines the low-cost scalable storage of data lakes with the data management and ACID transaction capabilities traditionally associated with data warehouses, eliminating the need to maintain separate analytical systems. Lakehouse architectures implement open table formats with metadata layers that enable schema enforcement, time travel, and concurrent read-write access on object storage. Key technologies include Delta Lake, Apache Iceberg, and Apache Hudi for transactional data lake management. IEEE and ACM research explores the performance characteristics and architectural trade-offs of lakehouse implementations.
The systematic administration of data about data, encompassing technical metadata such as schemas and data types, business metadata including definitions and ownership, and operational metadata covering lineage, access logs, and processing statistics. Effective metadata management enables data discovery, impact analysis, governance enforcement, and regulatory compliance through centralized or federated metadata repositories. Automated metadata harvesting, classification, and enrichment reduce manual documentation burden while improving coverage and accuracy. ISO 11179 and W3C metadata vocabularies define interoperable standards for metadata registration and exchange across organizational boundaries.