Why Machine Learning models need Schema-on-Read

Schema-on-Read Vs Schema-on-Write

Databases have employed a Schema-on-Write paradigm for decades, that is, the schema/structure is first defined up front and then the data is written to the said schema as a part of the write process. Once the data has been written to the schema it is then available for reading, as such it’s named Schema-on-Write.

Schema-on-Read has come about in conjunction with the rise data lakes primarily for data science use cases and machine learning models. Here the raw data is first landed in its native form (structured and/or unstructured) with no imposed schema. Only once the data is read is the schema is applied, hence Schema-on-Read.

Schema-on-Write has been working well as the de facto standard for decades, why would we want to change and what are the advantages of Schema-on-Read?

This blog post explains the differences and how Schema-on-Read is used for Machine Learning over Schema-on-Write.

Machine Learning models work on raw, detailed source data

Intelligent systems built on machine learning algorithms have the capability to learn from past experience or historical data.

And raw data sets are integral for Machine Learning algorithms to work effectively.  In some systems, the original data is no longer valuable once it’s been transformed.  Thus the data platform must be optimized for large quantities of raw data for discovery-oriented analytics practices such as data exploration, data mining and machine learning.  Cloud Data Lakes based on an object store like Amazon S3 or Azure Blob storage are a perfect environment to ingest raw data for unlimited scalability and store it very cost-effectively, and Schema-on-Read can be employed for data preparation for Machine Learning models.

No upfront schema to be defined when landing the data…

If we consider data as a shared asset within an organisation, it is used by many users, fulfilling many roles for many purposes. Schema-on-Write requires up front definition and understanding of all current and future use cases. This understanding is required to create a ‘one size fits all’ schema for all the use cases. Typically speaking a ‘one size fits all’ approach can work for everyone but is not a perfect fit for everyone.

This also requires understanding the data upfront and then trying to accommodate future requirements, which may not always be entirely predictable.

For Machine Learning models, you cannot work with aggregated or transformed data and raw data is critical for success.

Schemas created for specific use cases…

Schema-on-Read does not impose a structure when landing the data, it uses the source or native format or schema. It enables each user, role or purpose to define a schema that is specific to the use case. With Schema-on-Read the schema is defined to fit the use perfectly. It allows users to use and understand the data and derive value from it instantly, before modelling it for use cases that may benefit from a single model. As future/new/updated use cases are discovered new schemas can be created to meet these future use cases.

Single storage of data…

Schema-on-Read data is stored in the native format, the schema is applied only when read, this allows for a single data set to be used for many different use cases, but only stored a single time. Schema-on-Write would require multiple schemas to be defined and the data would probably be stored multiple times (perhaps once for each use case).

Agility of data…

One of the biggest benefits of Schema-on-Read is the agility of the data, by this we mean data can be landed with minimal up-front effort and then consumed immediately.


Schema-on-Read has had its doubters in the past, these are often based around the performance concerns (when compared to Schema-on-Write), however with the cloud technologies, this is no longer the issue it perhaps once was. Concurrency can be unlimited on the cloud with compute available on demand.

How BryteFlow helps to transition between both approaches…

The BryteFlow product allows quick ingestion of data by transferring only the changes in the data or deltas to Amazon S3 in its source format. Ingestion can be scheduled near real time for scale – for large volumes of data it is easier to synchronize frequently rather than one mega extract and load; and for speed – if data is required near real time for operational reporting. The data is automatically consolidated on Amazon S3 – inserts, updates and deletes applied on the data on Amazon S3- and the data can be used as a replica of the source whether it is SAP, Oracle, SQL Server or other sources for Machine Learning instantly.

This allows you to use the Schema-on-Read approach, so that you don’t sink all your time preparing the modeled data. Users can access data instantly on Amazon S3. It gives Data Scientists instant access to raw data for their analysis.

BryteFlow ensures that the data is extracted at a consistent point in time which is configurable, across the source and hence mitigates any temporal inconsistencies in the data, which can result from extracting the data from a source system at different times.

As users get more familiar with the data, data models become imperative; they form a way of consolidating and governing business access across the organisation. In this case, BryteFlow allows you to use an intuitive GUI on your raw Amazon S3 data to model your data and create data assets which can then be used across the organisation in a Schema-on-write fashion. These can be shared by the Data Scientists and other users across the organisation as well.

Amazon S3 being a scalable, cheap storage for your data allows you to keep both approaches, potentially one for machine learning and the other for business intelligence reporting and analytics.
By combining both methods, BryteFlow gives the agility and flexibility for multiple data initiatives and use cases in the organisation.