1.3 Data Management - Foundations and Concepts

Data management is the practice of collecting, organizing, preparing, protecting and storing data so that it can be used efficiently, securely and cost-effectively in the decision-making process. In modern society, data of different types is generated in large volumes from a variety of sources at an unprecedented speed, thus a robust data management solution is important to extract meaningful and enduring value from the data. Over and above the latter, data management is important to facilitate ease of data migration and transformation and also for regulatory compliance.

1.3.1 Data management process

The data management process is made up of the following components:

  • Data collection is the process of gathering the necessary data from the various data sources about the variables of interest for a particular study. This process typically involves a process referred to as sampling (see Chapter 4).

  • Data organization involves integrating different types of data, such as structured and unstructured data (see Chapter 2). This process is also referred to as data warehousing.

  • Data preparation involves cleaning and transforming raw data into a form that is suitable for further processing and analysis. This process is important for identifying and removing errors and duplicates in the data and also filling in missing data. This increases the accuracy and quality of the data. Data preparation is also known as data wrangling (see Chapter 3).

  • Data governance involves, amongst others, the processes and practices used to ensure data protection, security and privacy. Data protection includes safeguarding the data and restoring important information in the event of say, a data breach. Data security refers to safeguarding the data against theft, corruption and unauthorized access. Data privacy refers to safeguarding the collection, use and disclosure of personal and sensitive data to comply with policy and regulation such as the Protection of Personal Information Act (POPIA) of South Africa.

  • Data storage involves the retention of the data for future access. In modern society, data is usually stored in a digital format using an SQL database or a spreadsheet. The data files are kept on a personal computer or, in the case of large volumes of data (so-called Big Data), on servers, also known as cloud storage.

1.3.2 The benefits of data management in modern society

At its core, the benefits of an effective data management system include:

  • Availability and visibility – effective data management increases the visibility of the data by making it easily accessible. This in turn leads to high frequency data-driven decision-making.

  • Reliability – a good data management system leads to accurate decision-making by making sure that the data is reliable and up to date.    

  • Security – a good data management system protects the data against loss and ransom-ware type data breaches. Moreover, it ensures that the data are used within the bounds of policy and regulation in an ethical manner.

  • Scalability – a good data management system can allow repeatable data queries that build upon each other and thus keep the data up to date. Moreover, this mitigates inconsistencies and duplications of queries.

1.3.3 The challenges of data management in modern society

As is the case with any useful strategy, there are challenges towards effective data management. These includes, among others,

  • The size (or volume) of the data collected

    As mentioned in Section 1.1, data is everywhere. Given the size of the data generated today, traditional storage devices with storage capacity of up to gigabytes (GB) are no longer enough. We need data storage infrastructure with capacity up to terabytes (TB: ~1000 GB) and petabytes (PB: ~1000 TB).

  • The speed (or velocity) at which the data is generated

    Since data is collected at every second of every minute, we need sophisticated infrastructure to quickly effect the changes and keep the data up-to-date for future analysis.

  • The variety (or integration) of the data

    Data can come from multiple sources (e.g. social media and drone), types (e.g. structured and unstructured data) and formats (e.g. text and videos). This requires sophisticated infrastructure for data integration.

  • The veracity (or quality) of the data

    Given the variety of the data coupled with the speed at which data is generated, it raises a concern over the accuracy and consistency of the information. This can lead to duplication and errors in the data.

  • Changing rules and regulations

    The storage and use of data must comply with personal data protection rules and regulations while preventing cyber-attacks.

  • Data security and privacy

    Protecting sensitive data while ensuring compliance with data regulations and preventing unauthorized access while ensuring data accessibility for rapid data-driven decision-making.

1.3.4 Strategies for data management

The following strategies can address some of the major data management challenges that organizations face in modern society:

  • Data security and access control

    Develop a multi-layered data security system that has a robust role-based access control and data encryption system with clear audit trails.

  • Data integration improvement

    Develop a robust ETL (Extract, Transform and Load) process to extract data from various sources and transform it into a standardized format which can be loaded into a central storage system.

  • Data quality improvement

    Develop data validation rules, data profiling processes (such as analyzing and assessing the data to gain insights into its consistency and completeness) and error detection and correction procedures.

  • Data storage and cost optimization

    Implement a tiered data storage solution with a balance between on-premises and cloud in order to optimize the use of storage and reduce data processing costs.

  • Data speed optimization

    Develop a data indexing system for quick data access. Implement caching12 strategies to reduce the number of database queries and improve the response times.

1.3.5 Exercises to Section 1.3

Question 1

What is data management?

Question 2

Data quality refers to:

i.  The security and privacy of the data.
ii. The accuracy and consistency of the data.
iii. The speed at which the data is collected.
iv. The size of the data.

Question 3

The National Health Laboratory Services experienced a data breach which led to delays in processing laboratory tests across public health facilities in Gauteng.

  1. Which one of the following actions should be taken to ensure that the data can be quickly recovered should this happen in the future?

    1. Improve data privacy

    2. Improve data protection

    3. Improve data storage

    4. Improve data security

    5. Optimize data integration

  2. Which one of the following actions should be taken to ensure that patient’s personal information is not compromised should this happen in the future?

    1. Improve data privacy

    2. Improve data protection

    3. Improve data security

    4. Improve data storage

    5. Optimize data integration

Question 4

For each of the following scenarios, what do you think is likely to become a challenge in data management? Motivate your answer.

a. TymeBank, a digital bank, is reportedly on-boarding up to 5000 customers every day (that is, about 150000 customers every month).

b. A startup bank insurance (bancassurance) company collects its client’s data from cameras in their homes, smart watches and banking transactions.

c. A major South African bank and a life insurance company decided to merge their businesses. When it came to combining their client’s data, they failed to notice duplicates because of different data formats.

d. The University of Pretoria and University of Johannesburg are collaborating on a clinical trial for a new cancer drug. Researchers on both sides tend to store sensitive data on their personal computers and they use different names for the data files.

e. Researchers at the United Nations Convention on Climate Change are conducting an environmental monitoring study. Their sensors are collecting data in different formats, they have no backup system for their data and there is no access control on the data.


  1. Caching is to temporarily store data so that future requests for the data can be accessed faster.↩︎