1.7 Data analysis – Foundations and Concepts

Data analysis is the process of systematically using statistical techniques to explore, clean, transform and model data with the purpose of discovering useful information which can then be used to support decision-making.

1.7.1 The Data Analysis process

Data analysis can be viewed as a sequence of steps combining some of the data literacy skills we have already discussed so far in this book. This process can be summarized as follows:

  1. Define the study

    The first and important step in the data analysis process is to clearly define the problem that the analysis aims to address by stating the objectives and specific questions that the analysis aims to answer.

  2. Data collection

    Once the study is defined, the next step is to collect the relevant data. This can be done using various methods to be discussed in Chapter 4. The choice of method used to collect data depends on the nature of the problem and the questions being asked.

  3. Data cleaning

    After collecting the data, the next step is to clean the data. This involves identifying and rectify errors, missing values and inconsistencies in the data. Data cleaning will be further discussed and practically demonstrated in Chapter 2.

    Note: The second and third step are part of data management. See Section 1.3 for more details.

  4. Data exploration

    After cleaning the data, we conduct a preliminary analysis to understand the characteristics of the data. See Section 1.5 for more details on how this is done in practice.

  5. Data transformation

    Following from the results of the exploration phase of the data analysis, the data is prepared for analysis by encoding categorical variables, scaling (normalizing or standardizing) some numerical variables and, if necessary, handling outliers.

  6. Data modelling (analysis)

    Now the data is ready for the actual analysis. This step involves using statistical and mathematical techniques on the data to discover patterns, relationships, similarities or trends.

  7. Interpretation and visualization

    After the data analysis, the next step is to interpret the obtained results and present them in a manner that can be easily understood. This can be done through visualization.

1.7.2 Approaches to Data Analysis

There are different types of data analysis and each one serves a unique purpose. The choice of which one to use will depend on the nature of your study and what kind of questions you seek to answer.

  1. Descriptive analysis is used to describe and summarize the collected data to understand what happened in the past. For example, a university might use descriptive analysis to find out how many first-year students passed last year.
  2. Diagnostic analysis follows descriptive analysis by going a step further to explain or diagnose why something happened. For example, suppose the number of first years that passed last year dropped significantly, diagnostic analysis can be used to find out why this happened.
  3. Predictive analysis is used to forecast or predict what might happen in the future based on historical data. For example, a university can use predictive analysis to predict the number of first-year students that will pass next year. In other words, based on what has happened in the past we can find out what could happen in the future.
  4. Prescriptive analysis is used to make recommendations on what course of action to take in order to reach a desired outcome. For example, suppose a university wants to increase the pass rate for first-year students, a prescriptive analysis might suggest the best course of action towards reaching this outcome.

1.7.3 Uses of data analysis in modern society

Data analysis is a very important data literacy skill that finds application across various fields and domains of application such as:

  • Marketing research, where it can be used to assist businesses to understand market trends, consumer preferences and help identify opportunities for product development.

  • Medical diagnosis, where it can be used to interpret medical images (e.g. MRI scans) and also assist in early detection of a disease.

  • Medical drug discovery, where it is used by pharmaceutical companies, such as Johnson and Johnson, to develop a drug by conducting clinical trials and testing the effectiveness of the drug.

  • Fraud detection, where it can be use by banks to identify unusual transaction patterns and detect fraudulent activities.

  • Risk management, where it is used by financial services companies to assess a client’s credit risk (i.e. will they be able to pay back the loan) and model risks in the forex market or stock market.

  • Quality control, where it is used to monitor and control the quality of products on the production line.

  • Social science research, where it is used to analyze survey data to study overall human behavior and sentiment.

  • Recommendation systems, where it used by platforms such as Spotify and Netflix to recommend music or shows that you might like based on the content that you viewed in the past.

  • Environmental monitoring, where geographical (remote sensing) data is used to monitor ecological changes such as deforestation, water quality and air pollution.

1.7.4 Data analysis techniques

There are many techniques used in data analysis and each one serves a unique purpose and application. In section 1.5, we discussed data exploration. This is the most basic technique for data analysis. In this section, we will briefly discuss some of the most commonly used and emerging techniques.

  1. Correlation analysis

    Correlation analysis is a technique used to understand the linear relationship between two or more numerical variables. A simple measure that is commonly used to describe this relationship is the correlation coefficient, usually denoted by \(r\). The correlation coefficient is used to quantify the strength of the linear relationship between two numerical variables. The value of \(r\) will always lies between \(-1\) and \(1\). Values close to \(-1\) or \(+1\) indicate a strong linear relationship. The closer the value of \(r\) is to zero, say less than 0.5 or more than -0.5, the weaker the linear relationship. As an example of the use of the correlation coefficient, suppose we want to know the strength of the relationship between a student’s matric final mark and their final mark at the end of their first-year of study at a university. Given a correlation coefficient of \(r=0.93\), we can say that there is a strong positive linear relationship between a student’s matric final mark and their first-year final mark. In other words, a larger final matric mark is strongly associated with a larger first-year final mark. Finally, please note that correlation measures the linear association between two numerical variables and not necessarily causality. In other words, a high correlation between two variables does not mean changes in one variable will cause changes in the other variable.

  2. Regression analysis

    Regression analysis is a statistical technique used to understand the dependence of one variable, known as a dependent variable, on one or more other variables, known as independent variables. It is commonly used for predictive analysis. A simple and widely applicable approach to regression analysis is the least squares line. Consider two numerical variables \(x\) and \(y\) which are assumed to follow a straight line pattern. The relationship between \(x\) and \(y\) can be described using a straight line given by the equation

\[\begin{equation} y=A+Bx \tag{1.1} \end{equation}\]

where \(y\) is the dependent variable, assumed to depend on \(x\) known as the independent variable. The term \(A\) is the y-intercept and \(B\) is the slope which represents an increase in \(y\) for every unit-increase in \(x\). For a given sample of \((x,y)\) points, we can obtain an estimate of (1.1), given by,

\[\begin{equation} \hat{y}=a+bx \tag{1.2} \end{equation}\]

where \(a\) and \(b\) are estimates of \(A\) and \(B\), respectively, obtained by the method of least squares. Hence, equation (1.2) is known as the least-squares regression line.

Equation (1.2) can be used to:

  • describe the dependence of \(y\) on \(x\) allowing us to learn more about the process that produces \(y\).
  • comment on the type of linear pattern between \(x\) and \(y\) (whether its positive, \(b>0\), negative, \(b<0\), or no pattern exists, \(b=0\)).
  • measure the influence that \(x\) has on \(y\) based on the magnitude of the value of \(b\).
  • predict the future value of \(y\) for a given value of \(x\).

As an example of the use of the least-squares line for regression analysis, consider the least squares line \(\hat{y}=60+5x\) estimated to a data set on student population (in 1000s), \(x\), and quarterly pizza sales (R 1000s), \(y\), for a sample of 10 restaurants located near university campuses. The following points can be made about the fitted least-squares line:

  • \(b=5>0\) which implies that as student population, \(x\), increases, quarterly sales increase.

  • \(a=60\), which means for a restaurant that is not located close to a university (that is, \(x=0\)), the quarterly sales are \(R60000\).

Lastly, we can use the least-squares line to predict the quarterly sales for a given size of the student population. For \(x=16\), representing 16000 students, the quarterly sales are predicted to be \(\hat{y}=60+5(16)=140\) or \(R140000\).

  1. Cluster analysis

    Cluster analysis is used to group a set of objects or entities in such a way that objects in the same group (or cluster) are more similar than those in other clusters. It is commonly used in recommendation systems and market research to find consumers with similar preferences.

  2. Dimension reduction

    As implied by the name, this technique is used to reduce a large number of variables into few variables in such a way that the remaining variables capture the maximum possible information from the original variables. It is commonly used in conjunction with some of the already-mentioned techniques such as cluster analysis for medical diagnosis.

  3. Hypothesis testing

    This technique is used to make inference or statements about population characteristics (such as the mean) using sample data. It is commonly used in the control and monitoring of quality in a production process.

  4. Time series analysis

    This technique is used for the analysis of time-series data. It is commonly used for predictive analysis and understanding the trend overtime. It is commonly used in forecasting or predictive analysis.

  5. Sentiment analysis

    This technique is used to extract the emotional tone (negative, positive or neutral) behind text data. It is commonly used to understand customer feedback.

  6. Spatial data analysis

    This technique is used for the analysis of geographical (remote sensing) data, that is data with a spatial component (e.g. geographic location represented by coordinates). The most important use of this technique is in disease tracking to identify hotspots.

1.7.5 Computational tools and software for data analysis

Modern data analysis is carried out using sophisticated computational tools and software that cater for different needs and levels of expertise. This includes

  1. Python

    Python is the most popular high-level general-purpose programming language that can be used for a variety of tasks including data analysis. It is relatively easy to learn and has a range of libraries (pandas, NumPy and Matplotlib) that make it a favorite data analysis and visualization tool among data analysts and data scientists.

  2. R programming language

    R is an open source (free software) programming language developed specifically for statistical computing and visualization. Overtime, R has evolved to have a wide set of capabilities such as statistical software development and scientific text editing. It is a popular software among statisticians because it features tools such as hypothesis tests, correlation analysis, regression analysis, among many others.  

  3. SQL (Structured Query Language)

    SQL is a language for managing and manipulating data that is sored in databases.

    Note that it is not necessary to know programming in order to do data analysis. The following are the most popular non-programming data analysis tools used in industry:

  4. Excel

    Microsoft Excel is a spreadsheet that is most widely used for data analysis because it is easier to use. It offers a range of features for data collection (using the Sampling tool), exploration (using the Descriptive Analysis tool), modelling (using the Regression Analysis tool) and visualization (using the Charts tool).

  5. SAS (Statistical Analysis System)

    SAS is an advanced license-based software developed specifical for statistical data analysis and visualization. It is made up of procedures that can perform tasks such as data exploration (PROC MEANS), hypothesis testing (PROC TTEST) and many others.

  6. Power BI

    Power BI is a powerful business analytics tool developed by Microsoft. It enables users to create interactive visualizations with self-service business intelligence capabilities. Power BI is used to transform raw data into useful insights that are easy to understand through dashboards and reports.

  7. Tableau

    Tableau is a business analytics tool used to create interactive and shareable dashboards that show trends, variations and densities for important day-to-day business metrics through charts and graphs.

1.7.6 Exercises to Section 1.7

Question 1

For each of the following case studies, specify which type of data analysis is appropriate:

a. Eskom wants to reduce overall electricity waste and improve the stability of the national electricity grid. To achieve this, they want to propose energy-saving strategies to its customers.

b. Absa has noticed a decrease in their corporate clients overtime. They want to know what could be behind this.

c. In order to inform her decision on many warm clothing to bring on her trip to Essen, Germany in January, Renate wants to know what the average temperature will be in Essen, Germany in January.

d. In order to inform their decision on the interest rate at their next meeting, the South African Reserve Bank wants to know what the inflation rate will be in the next 12 months [predictive].

e. A general practitioner (GP) wants to know how many COVID-19 patients she treated between 2020 – 2022.

Question 2

For each of the following case studies, specify which data analysis tool (s) is/are appropriate:

a. Given past data on electricity consumption, Eskom wants to determine the amount of electricity that will be consumed in the next winter season.

b. A plant physiologist wants to understand dependence of plant growth on factors such as water availability, temperature and soil nutrient levels.

c. Suppose that you want to study the response of a plant to changes in temperature, drought, salinity.

d. A biochemist wants to classify proteins with similar structural characteristics.

e. The World Health Organization (WHO) wants to identify geographic hotspots for the monkeypox disease.

f. The WHO wants to understand public opinion about a new strain of virus from social media posts.

g. In order to inform their decision on the interest rate, the South African Reserve Bank (SARB) wants to forecast the average value of the rand per US dollar in the next 12 months.