Draft version 0.3a – Arofan Gregory, Lisa Seaburg, this paper was written last summer by Arofan with my input. It had started out as discussions we have had over many years.
There is sometimes confusion about the terminology used within the IT industry, especially when it comes to the topic of anything related to data. The hype around Big Data is so intense that sometimes we are blinded to the utility of less trendy technologies and information sources. While Big Data is important, it is only one piece of the puzzle. Often, the key issues around the management and use of data are the same, regardless of whether the data in question is Big Data or more traditional legacy.
Historically, many enterprise-level applications focused on documents as the key granular element in their systems; today, it is the data contained in documents, and not the documents themselves, which are often the most important organizational asset. And yet, many organizations have a difficult time realizing this asset in the face of document-bound legacy systems.
There is a difference between Data Engineering, Data Science, and Statistical Analysis, all terms often heard in connection with Big Data. These roles are explained here, and their relationship is clarified. There are also many different challenges which face Data Engineers in the performance of their function. This paper presents some of these challenges, and what Data Engineering solutions might look like.
In the 1970s and early 1980s, most technology applications were focused on the file as the basic object handled by computer applications. In the later 1980s and 1990s, the focus shifted to regarding the basic objects first to the idea of documents, which might be contained in several files, and then on the idea of structured documents, using technologies such as XML. In the latter case, granular detail was added inside the document, so that documents could be processed programmatically.
By the late 1990s, the design and implementation of systems using such structured documents was termed Document Engineering. There was a heavy emphasis on XML and related technologies, but implementations in Java, relational and object databases, Web applications, etc. were also included in the term.
Today, we are moving beyond the notion of Document Engineering. Documents – although still critically important – are only one way in which the data can be accessed and used. Now, we are looking at the many challenges we face as the data itself becomes the most important granular level of information which can be leveraged, but also needs to be managed.
Thus is born the term Data Engineering. Largely coming out of the Big Data space, there is far more to the domain of Data Engineering than simply working with Big Data. While the term is often used broadly, this is incorrect. The infographic at http://i2.wp.com/www.analyticsvidhya.com/wp-content/uploads/2015/10/infographic.jpg gives a good idea of the distinction between Data Scientists, Data Engineers, and Statisticians.
There are many sources of useful information to drive the functions of any enterprise, each presenting unique challenges. This paper explores some of the less-well-documented aspects of Data Engineering, to show how it can help enterprises best leverage all of their data assets.
Files, Documents, and Data
The distinction between files, documents, and data is a fundamental one. As described above, the initial unit of information for early computer systems was the file – the ability to link a set of files into a meaningful whole came along much later. The use of structured markup technologies such as the Standard Generalized Markup Language (SGML), the eXtensible Markup Language (XML), HTML (itself an implementation of SGML/XML), and others was an attempt to bridge the gap between documents and the data they contained. It is possible to use these technologies to describe data, rather than documents, but more often they are used as a way of describing only the structure of the document, rather than its content.
To give an example from the aerospace industry, the standard SGML/XML document type definitions produced by the ATA include an element named “PGBLK”. This is an artifact which was very useful when aircraft maintenance manuals were used in the form of loose-leaf binders containing paper pages, which were frequently updated. In terms of the actual data contained in the page blocks, this element is meaningless. Even the older versions of HTML were limited to describing headings, paragraphs, and lists, which are not a description of the data contained, but only of the web page’s presentational structure (later Web-based technologies have solved this problem).
There is a fairly nuanced but important distinction between documents and data – the data has semantics, metadata (information about the data), and links to other related data with a machine-actionable description of the type of relationship, and all at the level of each individual piece of data. The data item can be managed independent of the context (that is, document) where it is used. Documents themselves may have several component parts. But generally they are managed and have metadata only at the document level. Documents of any kind can be understood as “views” into the pool of data, where they show the required data inside a presentational framework of some kind, whether that is a document template or a mobile app, or anything in-between. The view has an existence apart from the data which it uses, and on which it depends, and so can be managed like a normal document today (if needed).
In one sense, it is very difficult to draw an absolute distinction between documents and data – they are intimately related. What is most important is whether the data contained in documents can be used as data: this is dictated by systems and applications. A data-driven application (or, more often, a metadata-driven application) is one in which the data is machine actionable (that is, processes can be conducted without the intervention of humans in an efficient way, replacing manual systems). This requires a fine-grained handle on both the structure and the purpose of the data, typically well below the document level. Often, large amounts of metadata are required.
Another strong requirement of data-driven systems and applications is good data management. Centralized repositories (or virtual centralized repositories, across distributed data stores) are a requirement, and the practice of identification, versioning, and ownership management across the enterprise at the level of the data item is also critical. The phrase “single source of truth” is often heard when talking about modern data management systems – there should be only one canonical copy of any data or metadata item/object, and it should be used by reference everywhere else (at least notionally – it could exist in automatically-updated, read-only caches in many applications simultaneously, but it must have a clear owner or set of owners with the privileges to edit and create new versions).
The Technology Picture Today
There are many different technologies involved with the practice of Data Engineering, including SGML, XML, and the related family of technologies (XSLT and XSL-FO for transformation; XPath for navigating XML documents; XQuery for querying XML documents and systems, etc.). Also in this picture are databases, both of the traditional relational and object-oriented kind, but also including the Big Data database platforms such as Hadoop and others. Systems for providing indexing, registration, and repository functionality are also important in this work. Web technologies – including the Web of Linked Data/Semantic Web technologies based on RDF and related standards – have become important. Statistical analysis tools such as SAS, SPSS, and R often interface with Data Engineering platforms, even though they are essentially the tools of statisticians.
The structure of different types of data is still important to Data Engineering, as the ability to automate processes is reliant on a knowledge of that structure. Data Engineering requires both a knowledge of the structure of the data and of its meaning or content. Tools such as UML class diagrams are often used to model data, so that developers and others can easily understand what the data they are working with looks like, for both semantics and structure.
It should be noted that many Document Engineering technologies and tools can be used to support Data Engineering activities, especially regarding the platform technologies such as databases, the family of markup technologies (SGML, XML and related), and Web technologies. The trick is to implement systems in such a way that the data-aware functionality of these tools is used, as opposed to more traditional document-centric application.
Big Data: Data Science and Data Engineering
It is important to understand that the function of Data Scientists and Data Engineers is different. Data Scientists are involved in the examination of data to determine how it can be meaningfully used. They start with a specific goal – a “research question” in the terms of some communities – and then identify which parts of the data are relevant to creating a data set which can be analyzed to provide answers to this research question. The next step is to identify how the selected parts of the raw input data must be manipulated in order to create such an analysis data set.
The result of this activity is an algorithm (also known as a “statistical method” or “experimental design” in some domains) which can be applied to the raw inputs to produce a meaningful data set for analysis. It is not the final analysis of the data – that is the work of a statistician or analyst – but the creation of a tool which can be applied on a repeat basis to different raw input data from the same source or sources, to produce a series of data sets for final analysis.
The Data Engineer acts in a supporting role to this activity, implementing data stores, analyzing data structures so that the data can be stored and accessed correctly, and building systems which the Data Scientist can use to explore the data and develop a useful algorithm. From the Data Engineer’s perspective, the Data Scientist is a user of the systems they build.
Statistics, Quantitative Analysis, and Data Engineering
The role of a Statistician or Analyst is different again from those of the Data Scientist and the Data Engineer. The Statistician or Analyst (the terms are synonymous) is to examine the data produced by applying the algorithm to a set of raw input data (the analysis data set) and produce a result which is fit for purpose (typically an aggregated data set which can be displayed as tables, charts, graphs, etc.) The end result will be used to inform decision making by the business or organization which posed the initial research question.
Again, Data Engineers act in a supporting capacity for these users, providing them with analysis data sets in a form which is appropriate to the tools used by the analysts (software packages such as SAS, SPSS, Stata, R, etc.). They also provide tools for finding, storing, and accessing analysis data. Similarly, they are responsible for creating tools which allow analysts to store the results of their analysis. (Also, there will be mechanisms for finding and accessing the results, but these will be a different type of business user.) From the Data Engineer’s perspective, the Statistician or Analyst is a user of the systems they build.
Other Data Engineering Users
Although the roles described above are focused mainly on those involved in quantitative analysis in support of business decisions made by executives, there are many other applications for the use of organizational data assets. In many legacy (and production) systems, documents contain a wealth of information which can be treated as data for the purposes of automation.
One good example of this is the use of information involved in operations or manufacture, which then needs to become the content of technical publications. If we look at maintenance and repair operations (MRO) in the aerospace industry, we see many examples of data which are reused, but for which the data are locked inside documents marked up only to expose the structure for publication. Aircraft manufacturers publish Airplane Maintenance Manuals (AMM) to accompany the aircraft they sell; during maintenance, the content of these manuals is reduced to individual tasks – task cards – to be performed and signed off as maintenance operations are performed. The management and reuse of this information is difficult, as the two documents live in separate (even if virtual) documents. If one document is changed as the result of configuration, for example, there is a strong possibility that the two documents will get out of synch. If managed in a single place, as data, rather than as two documents, there is no possibility of them getting out of sync, only the production of managed versions and variants of the original data.
Here, the users (as seen from the Data Engineer’s perspective) are the aircraft maintenance technicians and the technical writers. This may not be a sexy new big data application, but it does demonstrate a solution to an existing problem; the key is that what have historically been seen as documents are being managed (and structured) as data. The data themselves are the fields within the AMM and the task cards – assuming that they have been marked up/stored as data, and not only hidden in document structures.
Many other parallel examples exist – if we look at the documents which are the product of supply-chain transactions in any industry (ie, Purchase- and Change-Orders, Invoices, Shipping Notices, etc.) we find that there are many fields in the documents which could potentially supply us with a rich set of information to be mined for re-use. One example here is performance metrics: how well did a particular shipper perform in terms of timeliness? What type of goods were they shipping, and in what volume? To what locations? How do these factors combine to allow the selection of the best performer? These type of metrics can be anything from very simple to very sophisticated if an overall body of transactions is examined, and may inform the decisions being made well below the executive level. But they involve having access to particular bits of data which today are often stored (or treated) as business documents, rather than as data assets.
The users here may well be mid-level managers responsible for making day-to-day decisions and assessments, rather than top-level executives asking for reports on specific aspects of the business.
Another example of data re-use has to do with publication: what used to be published as monolithic documents in the form of PDF, for example, is now presented in multiple different ways: through dynamic apps for tablet computers and mobile phones; on web sites; as monolithic or modular documents. Each type of publication asks for a slightly different set of data, combined and presented in different ways. Although existing XML and Web technologies can handle the presentational reuse, other applications (such as dynamic ones) require that the content itself be modified to suit the delivery platform.
If publications are seen as views into the centralized, managed data, rather than as documents, these problems can be handled much more easily. Such an approach also provides great flexibility as new publication platforms emerge in future. This is not to say that published views are ephemeral – if important, they can be managed separately, as publications. But the central managed data will not go out of sync, even though it may be versioned and used in variant forms.
For these types of applications, the publishers of data become the user from the perspective of the Data Engineer.
How it All Fits Together
To illustrate the interaction between the Data Engineer, the Data Scientist, and the Statistician/Analyst, we will provide an illustrative scenario. In this scenario, raw input data will be sourced from somewhere – in our example, sensor data, combined with parts information and maintenance logs from inside the organization and from the aircraft manufacturers. The raw input data is not yet in a useful form for the purposes of supporting decision-making. These raw inputs will be explored, transformed for analysis according to some algorithm into a useful data set, and will then be analyzed to produce a final result. It is this final result which is useful to the organization conducting this process (typically in the form of graphs, tables, charts, or other data visualization). In our case, the question to be explored regards parts failure in different fleets of aircraft: does the same part perform differently with an A310 than it does with an equivalent Boeing aircraft? (There are many related questions regarding parts failure analysis.)
This is the type of quantitative analysis process which is often used by researchers, corporations, and government agencies, each for their own specific purposes, to inform executive decision-making.
This process is presented in the form of a BPMN diagram below. For those unfamiliar with BPMN, the square boxes represent process steps, the solid arrows the process flow. The cylinders are data stores, and the dotted arrows with open heads associate the data stores with the process steps which use them and the data they contain. The dotted arrows with closed triangular heads indicate messages being sent.
The raw input data will exist in some form, which will be extracted and then loaded into a storage system such as a database. A notification is then sent to the Data Scientist. The creation and implementation of such a system is the job of the Data Engineer. From here, it will be accessed by the Data Scientist, who will work with the data experimentally until an algorithm for extracting a suitable data set for analysis can be identified (the Explore Inputs and Develop Algorithm steps). This results in a notification being sent back to the Data Engineers that the algorithm is ready for use.
The Data Engineers will have developed a system for running the algorithm against the input data, producing an analysis data set, which is then itself stored in a database or repository (the Analysis Data Store). When requested by the Statistician or Analyst, an Analysis Data Set is extracted from the Analysis Data Store, and prepared for use in whatever form the Statistician or Analyst needs (SAS, SPSS, Excel, etc.) These are the Extract Analysis Data and Transform Analysis Data steps, also performed by the Data Engineer, resulting in the transformed data being saved in the Analysis Data Store, a message being sent to the Analyst that the data is ready for them in the appropriate format.
Now provided with an Analysis Data Set, the Statistician or Analyst can process and model this data, producing a final result. The final result is then taken and stored by the Data Engineers in a database or other repository (the Analysis Results Store) from where it can be accessed by the decision makers. We end our process flow there, although there would be additional processes downstream which use the final results – they would be published to decision-makers, or otherwise accessed/used in various ways.
Note that these different roles are not job titles – they are simply roles performed by someone (or by an automated system implemented by the Data Engineers). In many organizations today, the function of Data Scientist and Statistician/Analyst are performed by the same individuals, and the Data Engineering piece is performed by IT staff. This in no way alters the utility of separating the roles conceptually.
The importance of this separation is that the identification of algorithms and methods is not a function of data management. The analysis of data is not a function of data management. For these functions to happen, however, there must be a data management function, and one which involves the implementation of systems for data management. This is the job of Data Engineers. Some of the challenges of managing data – and some solutions – are presented below.
Data Management Challenges and Solutions
There are many different data management functions, depending on each particular context: different organizations have different management needs as regards the use of data, and also have different types of data. What we present here is a list of typical functions which occur across many different organizational contexts.
In each case, there are Data Engineering solutions which can address the needs of each function. These are discussed here only in general terms – many of the solutions are common IT practices which may be sufficient for the use of data – a good example of this is normal SQL database Extraction, Transformation, and Load (ETL) which is well-known within many IT departments today. ETL provides highly structured data loaded into a good storage repository, where it can be accessed, queried, and used. In other cases, solutions are not as common or well-understood, especially when dealing with non-traditional types of data, and require new technologies (Big Data is a good example).
Data must come from a source, whether it is from instruments such as sensors or from human-driven activities such as writing, recording, surveys and polls, etc. If these activities are to produce meaningful and useful data, then the raw inputs need to be stored in some useful system, and their inherent formats and structures must be understood. Data collection may involve transformations, to make the data storable. Data collection may also produce a large amount of useful metadata (see Understanding Data, below).
A good example of data collection is traditional relational database ETL: the data is loaded into a known structure with a high degree of granularity, and may have been subjected to validation and cleaning processes. It is query-able and readable, with each bit of data identified and potentially versionable.
It should be noted that the many different types of data collection all have different requirements – what is known about a census (one form of data collection) is very different from what might be known about data coming from a medical sensor. What they have in common is that both types of data require specific metadata in order to make the collected data meaningful.
There are many types of data storage: relational databases, object databases (including databases made specifically to handle XML), file servers, RDF triple and quad stores (for use in the Web of Linked Data/Semantic Web), non-SQL databases and Big Data platforms, etc.
Another type of data storage is the distributed storage system, where multiple data stores are made to behave as a single database, using a centralized index or catalog. This type of storage system is often referred to as a registry or registry-repository. When designing systems, it is modern practice either to centralize the storage of data in a single data warehouse for the entire enterprise, or to create a virtual centralized data store using a registry.
When implementing systems which allow for a “single source of truth” – that is, having only one canonical instance of any data item, and using it by reference elsewhere – identification of data items and the higher-level objects which reference them becomes critical. The challenge is that different systems use their own identifiers, but cannot necessarily work with the identifiers exposed by other systems, when a universal system of identifiers across the enterprise is needed. Another challenge is that legacy documents often do not assign sufficient identifiers to treat the data they contain as a set of identified data items.
These challenges can and have been met by many different types of repositories and registries, but within organizations and systems today, they typically are not. Identification of data is, however, a critical part of data management, and therefore Data Engineering.
Something at which many organizations fail is the appropriate versioning of their data items and data in general. While the versioning of documents is handled well by many systems (especially document management systems such as Documentum) they are not necessarily good at managing lower-level versioning, especially when multiple variants of data items (that is, data items which extend or modify similar data items) are in use. Versioning implies that there must be one or a set of owners of the data, and that they control any changes made to data items, producing new versions.
There are many solutions to this problem, but it is an essential part of good data management. As such, it is an essential part of Data Engineering.
All of the different people who interact with data – authors, users, developers, etc. have a need to understand what the data are – they need the documentation which tells them what they are working with. For some types of data – quantitative data sets, for example – documentation is sufficient for the data users. For other types of data, the term for the needed information is metadata. Metadata are the pieces of information we need to know about the data in order to use it effectively, and for the purpose of automation. Many systems today hold metadata at the document level and at the file level. To fully leverage the potential of the data, however, the metadata too must exist at the data item level. This can be a challenge, as in some cases the needed metadata simply do not exist, and must be created, which can be a resource-intensive proposition.
There are answers to this problem, however – it is often possible to capture metadata programmatically as the data themselves are collected or created. The idea of “upstream metadata capture” is one that has been implemented effectively for some types of data. Another useful approach is metadata mining, in which programs can extract metadata from the data themselves. Again, although sometimes difficult, the existence of good metadata at the data item level can be a very powerful ingredient for the automation of data management processes. As such, it is important to the Data Engineer’s function.
In all cases – and especially in large organizations with a lot of data – one challenge users face is the location of the data they need. If good storage systems are in place, with appropriate identification and versioning, it is possible to index, classify, and catalog data to make this task easier. Many existing technologies (search engines such as ElasticSearch and Solr, for example) can help with this requirement. There are also cataloging tools which make the deployment of enterprise portals easier. Other approaches based on RDF and Semantic Web technologies can also help in the location of needed data. The correct solution here is very much dependent on the type of data in question. The ability to search for, navigate, and locate needed data is a critical part of the overall functionality provided by good data management systems.
Once located, data needs to be accessed. This is relatively easy, except for those cases where data is sensitive, and access must be restricted only to those with the correct privileges to see the data. Many access-control solutions exist, so this is not a difficult requirement to meet, but it must always be taken into account. In many cases, data will be manipulated in order to make it safe for general use, a process which may involve loss of data.
There is more to data access than just access control, however. While data should be managed in a single, canonical form, there is also a requirement that it be provided in a format which is useful to the user. For a data analyst, this might be in the form of databases used by analysis packages (SAS, SPSS, Stata, R, Excel, etc.). The ability to provide data in useful forms requires good, data-item level metadata.
There are solutions to this problem, but they are not yet well-known, and do not always work well – this is an emerging part of the Data Engineering space. In some cases (as with image files) the needed transformations are well-established. In other cases (quantitative data, word-processing formats) the challenge may be more difficult, as transformations are not always lossless. Regardless, providing access to data in a useful form is a requirement for good data management systems.
Cleaning, Editing, and Processing Data
There are many places within a data management system where automated or programmed processing takes place. This may involve data validation on input, data editing and checking, data cleaning (to identify duplicates, missing data, violation of rules, etc.) and many other functions. For raw input data coming from external sources, or even from disparate sources within the enterprise, data validation and cleaning are almost always necessary.
Again, many solutions exist to provide this functionality: the XML-related technology Schematron is a good example, but many of the data analysis packages can also be used to provide this functionality. There are also many data warehousing tools and techniques for data validation. The ability to validate and clean data programmatically can be a major efficiency gain, but it requires the existence of well-managed data and metadata.
One of the major benefits of managing data is that it can be reused, providing a high degree of consistency and a consequent rise in the overall quality of the data. A centralized (or virtually centralized) data store with appropriate identification and versioning policies provides a solid platform for data reuse. The reuse of data is a key weapon in the Data Engineer’s armory, as it provides immediate and tangible benefits. Technologies such as XML are designed to support reuse, making this relatively easy to implement.
Publishing and Visualizing Data
It is always the case that – in order to support data-driven decision making – a data management system must provide a useful front end for those interacting with the data. There are many good tools for data visualization, from publishing technologies for text and tables, graphing and charting packages for quantitative aggregate data, Geographic Information Systems (GIS) for spatial data, etc.
Often, enterprises already have good solutions for publishing their data. The benefit of using a Data Engineering approach to this functionality is that it becomes easier to provide additional ways of viewing data as technology advances: providing suitable data views for devices, for example, instead of the PDFs which work well for larger screens; or providing a query able interface to the user so that they can assemble and configure the view that they want or need, instead of making one which is pre-designed. This flexibility in the ability to publish data is a benefit of using a Data Engineering approach.
As technology has advanced from being file-oriented to being document-oriented, we have seen the development of some powerful tools and approaches, especially when we consider the structured documents made possible by such technologies as XML. Now, however, we have moved beyond even the structured documents provided by a Document Engineering approach, into a world where data, big and small, is the central point of focus. We are no longer looking at documents which have been marked up with structural tags, but have also injected tags which identify the data they contain, allowing it to be identified, versioned, and managed. This is the domain of the Data Scientist.
In order to take advantage of this paradigm shift, it is most important to recognize the difference in how we organize, manage, and use our data, and also to recognize the different roles involved in leveraging the potential benefits. A Data Engineering approach is one that embraces the power of good, granular data management, and provides systems that are less expensive to maintain, more flexible in terms of how data is delivered to users, more efficient through improved automation, and supportive of better-informed decision-making.