Why Machine Learning Models Need Schema-on-Read

Schema-on-Read Vs Schema-on-Write

Schema-on-Read and Schema-on-Write are both essential elements of data management systems. Data management systems are by default either Schema-on-Read or Schema-on-Write. These structures- Schema-on-Read and Schema-on-Write are specifically designed to store data or prepare it for analytics.

What is Data Management?

In broad terms, the processes by which data is ingested, stored, categorized, and maintained by an organization is data management. Effective data management is crucial since data is an asset that helps in realizing valuable business insights, streamlining processes, increasing production output, decreasing costs, and maintaining machinery and equipment. However, when data management is not in place, the organization may suffer locked, siloed data, bad data quality leading to incorrect assumptions and flawed insights, sub-optimal use of BI applications and wasteful data swamps.

Schema-on-Write is associated with Relational Database Schema

Databases have employed a Schema-on-Write paradigm for decades, that is, the schema/table structure is first defined up front and then the data is written to the said schema as a part of the write process. Once the data has been written to the schema it is then available for reading, as such it’s named Schema-on-Write.

ETL from relational databases needs Schema-on-Write. The Schema-on-Write approach means the tables must be created first and schemas configured before data can be ingested. Relational databases have structured data whose structure is known in advance, so you can create tables accordingly, defining columns, data formats, column relationships at destination before the data is uploaded and available for analytical queries.

Schema-on-Read is associated with the rise of Data Lakes

Schema-on-Read has come about in conjunction with the rise of data lakes primarily for data science use cases and Machine Learning models. Here the raw data is first landed in its native form (structured and/or unstructured) with no imposed schema. Only once the data is read is the schema is applied, hence Schema-on-Read. Create an Amazon S3 Data Lake with BryteFlow

Schema-on-Read is the opposite of Schema-in-Write. With the Schema-on-Read approach, the schema is created only when the data is read and not before data ingestion. Data schema are created while the ETL process is carried out. This enables raw, unstructured data to be stored in the database. The huge growth in unstructured data and high costs associated with the Schema-on-Write process have triggered the development of the Schema-on-Read process. Create a Snowflake Data Lake

Why Schema-on-Read scores over Schema-on-Write

Schema-on-Read is much faster than Schema-on-Write since the schema does not need to be defined prior to loading. This is a huge advantage in a big data environment with lots of unstructured data. The Schema-on-Read process can scale up rapidly as per requirement and can consume vast volumes of data in quick time, since it is not constrained by data modelers required in the case of a rigid database. In contrast, Schema-on Write is a slow, time-consuming, resource-intensive process and appropriate for small volumes of structured data that are not likely to change. How to Bulk Load Data to Cloud Data Warehouses

Schema-on-Read Advantages for Machine Learning over Schema-on-Write

Schema-on-Write has been working well as the de facto standard for decades, why would we want to change and what are the advantages of Schema-on-Read especially in areas like Machine Learning?

Machine Learning is part of the fast-growing AI environment and uses algorithms to analyze huge amounts of data. Machine Learning algorithms operate independent of time constraints or human bias and try to make sense of the data by computing thousands of data combinations. Machine Learning can provide hypotheses for important business queries and test them out quickly in mere seconds, fleshing out an accurate and comprehensive data narrative for the questioner.

Machine Learning Models work on raw, detailed Source Data and Schema-on-Read

Intelligent systems built on Machine Learning algorithms have the capability to learn from past experience or historical data. And raw data sets are integral for Machine Learning algorithms to work effectively.  In some systems, the original data is no longer valuable once it’s been transformed.  Thus, the data platform must be optimized for large quantities of raw data for discovery-oriented analytics practices such as data exploration, data mining and Machine Learning.  Cloud Data Lakes based on an object store like Amazon S3 or Azure Blob storage are a perfect environment to ingest raw data for unlimited scalability and store it very cost-effectively, and Schema-on-Read can be employed for data preparation for Machine Learning models. Video Tutorial to create an S3 Data Lake

Schema-on-Read requires no upfront schema to be defined when landing the data

If we consider data as a shared asset within an organization, it is used by many users, fulfilling many roles for many purposes. Schema-on-Write requires upfront definition and understanding of all current and future use cases. This understanding is required to create a ‘one size fits all’ schema for all the use cases. Typically speaking, a ‘one size fits all’ approach can work for everyone but is not a perfect fit for everyone.

This also requires understanding the data upfront and then trying to accommodate future requirements, which may not always be entirely predictable. For Machine Learning models, you cannot work with aggregated or transformed data and raw data is critical for success, which requires Schema-on-Read.

With Schema-on-Read you can create schemas for specific use cases

Schema-on-Read does not impose a structure when landing the data, it uses the source or native format or schema. It enables each user, role, or purpose to define a schema that is specific to the use case. With Schema-on-Read, the schema is defined to fit the use perfectly. It allows users to use and understand the data and derive value from it instantly, before modelling it for use cases that may benefit from a single model. As future/new/updated use cases are discovered, new schemas can be created to meet these future use cases. Create an S3 Data Lake in Minutes

Schema-on-Read allows for Single Storage of Data for many use cases unlike Schema-on-Write

Schema-on-Read data is stored in the native format, the schema is applied only when read, this allows for a single data set to be used for many different use cases, but only stored a single time. Schema-on-Write would require multiple schemas to be defined and the data would probably be stored multiple times (perhaps once for each use case).

Schema-on-Read enables greater Agility of Data

One of the biggest benefits of Schema-on-Read is the agility of the data, by this we mean data can be landed with minimal up-front effort and then consumed immediately. With Schema-on-Read, data load is very fast since the data does not need to comply with any internal schema for reading or parsing, it is basically just copying or moving of files. Learn about Data Extraction for ETL

Performance with Schema-on-Read is no longer an issue with Cloud Technologies

Schema-on-Read has had its doubters in the past, these are often based around query performance concerns (when compared to Schema-on-Write), however with the cloud technologies, this is no longer the issue it perhaps once was. Concurrency can be unlimited on the cloud with compute available on demand, which enables faster querying.

BryteFlow leverages both constructs – Schema-on-Read and Schema-on-Write

BryteFlow Ingest allows for quick ingestion of data using log-based CDC (Change Data Capture), and transfers only the changes in the data or deltas to Amazon S3 in source format after the initial full extract. Ingestion can be scheduled near real-time for scale. For large volumes of data, it is easier to synchronize frequently rather than one mega extract and load. Change Data Capture also works for speed – if data is required near real-time for operational reporting. The data is automatically consolidated on Amazon S3 – inserts, updates and deletes are applied and merged with the data on Amazon S3 – and the data can be used as a replica of the source for Machine Learning instantly, whether it is SAP, Oracle, SQL Server, My SQL, Postgres, or any other source. Compare AWS DMS with BryteFlow for Data Migration

Instant Access to Raw Data on Amazon S3

BryteFlow’s CDC to S3 allows you to use the Schema-on-Read approach, so that you don’t sink all your time in preparing the modeled data. Users can access data instantly on Amazon S3. It gives Data Scientists instant access to raw data for their analysis. They can use Amazon Athena or Redshift Spectrum to query the data on S3. How to increase performance on Amazon Athena

BryteFlow Ingest ensures that the data is extracted at a consistent point in time which is configurable, across the source, and hence mitigates any temporal inconsistencies in the data, which can result from extracting the data from a source system at different times. BryteFlow TruData provides automated data reconciliation, validating data for completeness. Learn about how BryteFlow works

An Intuitive GUI to Model Raw Data on Amazon S3

As users get more familiar with the data, data models become imperative; they form a way of consolidating and governing business access across the organization. In this case, BryteFlow provides an intuitive GUI with BryteFlow Blend to model raw data on Amazon S3 and create data assets which can then be used across the organization in a Schema-on-Write fashion. These can be shared by the Data Scientists as well as other users across the organization. Create an S3 Data Lake in Minutes

Amazon S3 being a scalable, cheap storage for your data allows you to use both approaches, potentially one for Machine Learning and the other for Business Intelligence Reporting and Analytics. By combining both, Schema-on-Read and Schema-on-Write approaches, BryteFlow provides the agility and flexibility for multiple data initiatives and use cases in the organization. Get a Free Trial of BryteFlow