2.1 Data Sources

Knowing where a dataset comes from and where and how it was collected is crucial for us to determine if it will allow us to derive the insights we need.

Example: Ahmed is a consumer scientist working for Freshdale, a South African dairy company. His job is to determine whether Freshdale should invest more money into producing milk, cheese or yoghurt. For each of the datasets below, discuss if it is appropriate to answer this question.

A spreadsheet listing the current production of milk and yoghurt from the farms that supply to Freshdale.
A dataset on trends in lactose intolerance in South Africa.
Results from a survey on dairy product consumption in South Africa conducted in 2024, which targeted the rural population in KwaZulu-Natal.
Results from a survey on milk, cheese and yoghurt consumption in Angola conducted in 2024, targeting the entire population.
Results from a survey on dairy product consumption among Fresdhale’s consumers conducted in 2021.

Possible answers:

The spreadsheet will be informative regarding the current production of dairy products. However, this will not tell Ahmed what people are actually consuming. It may thus not be helpful to determine the demands of the market.
A dataset on lactose intolerance would not be useful, since lactose intolerant people are not (typically) included in the market for dairy products. (Lactose-free dairy products are becoming increasingly common, but determining the market need for these products is not the question Ahmed is trying to answer.)
Although this information is quite recent, and based on South Africa, the survey targeted the rural population in KwaZulu-Natal only. The needs of this community may differ from the needs of the suburban and urban populations in KwaZulu-Natal, and the rural, urban and suburban populations in other provinces.
Although this survey is recent, and targets the entire population including urban, rural and suburban, Ahmed would have to account for the fact that the dairy market in Angola may be quite different from the dairy market in South Africa.
A survey targeted at Freshdale’s consumers would certainly provide the most relevant information out of all of the above options. However, it should be taken into account that the survey is a few years old, and may not reflect the needs of the market in 2025.

In the above example, the data sources were all examples of secondary data sources. The next sections will introduce primary, secondary and tertiary data sources, provide examples of these data sources, and discuss scenarios in which they are appropriate. The last section in this chapter will present examples of where to find open data online.

2.1.1 Primary Data Sources

Primary data sources are collected first-hand for the purpose of a specific study. Examples of primary data sources include:

Surveys: A survey that is created by a researcher to answer a specific question. In the example of Freshdale, Ahmed could create a primary data source by conducting his own survey on dairy consumption.
Interviews: In-depth interviews with individuals to determine their opinions and experiences.
Experiments: Laboratory experiments that are tailored to answer a specific question.
Field Data: Observations of data in a natural environment, such as data on animal movement and behaviour.
Direct Measurements: Direct measurements involving the use of a specific tool, like a thermometer to measure patient temperatures, a GPS device to capture the locations of mineral deposits, or a sensor that measures chlorophyll in leaves.

Can you think of other examples of primary data sources?

Remember: A primary data source is always created by the researcher themselves to answer a specific question or questions.

Primary data source pros and cons:

Pros:

Data is specifically suited to the research question
Data is current and up-to-date
Data is accurate and reliable (assuming the researcher followed correct data collection protocols)

Cons:

Data collection can be expensive
Data collection can take a long time
If you as a researcher create a primary data source through data collection, you are responsible for its accuracy, reliability, and the ethics of data collection!

2.1.2 Secondary Data Sources

Secondary data sources are pre-collected by a different researcher for a purpose other than the current research question. In the Freshdale example, the surveys and datasets that Ahmed could choose from were collected by other researchers, in different years and different countries, to answer their own questions.

Consider the survey on dairy product consumption in the rural population of KwaZulu-Natal. Let us say that Ahmed’s colleague Amina created that dataset. For Amina, it was a primary data source, since she collected the data herself in order to answer a specific question. Ahmed did not collect that dataset, and would be using the dataset to answer a different question. Thus, it would be a secondary data source for Ahmed.

Examples of secondary data sources include:

Open datasets provided on websites like Kaggle, the United Nations Office for the Coordination of Humanitarian Affairs (OCHA), and others.
Government datasets and data bases (e.g. the census collected by StatsSA).
Market research datasets providing insights into consumer trends, from companies like McKinsey.
Data provided in academic papers.
Social media data, such as tweets scraped from X.
Historical data from archives and records.

Can you think of other examples of secondary data sources?

Remember: A secondary data source was created by another researcher to answer a different research question.

Discuss: If Ahmed conducts a survey on dairy consumption in South Africa in 2025, is it a primary or secondary data source? If Ahmed then uses this data in 2026 to analyse only milk consumption, is it a primary or secondary data source?

Secondary data source pros and cons:

Pros:

Obtaining data is easy
Some datasets could be available for free (e.g. government and open data)
It does not take long to obtain data
Data is usually provided in a usable format

Cons:

Data is not specifically suited to the research question
Data is from the past (you will need to evaluate its relevance in the present)
The accuracy, reliability and ethical nature of the data cannot be controlled by you as the researcher. Although you can clean the data, you cannot go back and change how it was collected.

Note: Secondary data is not guaranteed to be clean, accurate, reliable or ethical. As a researcher using the data, it is still your responsibility to check these aspects before using the data in your own research.

2.1.3 Tertiary Data Sources

Tertiary data sources summarise and describe the information contained in primary and secondary data sources. They can provide a very useful starting point to study, understand and research a particular topic. However, they cannot be used to answer research questions in the same way as primary and secondary data sources.

Examples of tertiary data sources include:

Textbooks that explain and summarise existing information on a particular topic. (These notes are an example of a tertiary data source!)
Abstracts that summarise the contents of a research paper or thesis.
Review articles that present and discuss the latest research on a particular topic.
Encyclopedias that provide an overview of certain topics.
Databases and search engines like Google Scholar or library databases that contain links to books and papers on specific topics.

Can you think of other tertiary data sources?

Remember: A tertiary data source is a summary of primary or secondary data sources.

In the Freshdale example, Ahmed might want to read a review paper discussing the latest research on dairy consumption.

2.1.4 Where Can I Find Data?

The list below describes some sources of open international data.

Kaggle: A popular platform for data science enthusiasts that hosts a wide range of datasets on topics like social media, climate change, and more. It also includes competitions and collaboration opportunities for machine learning projects. URL: https://www.kaggle.com/
Data.gov: A comprehensive portal managed by the U.S. government offering datasets across various fields like health, education, and transportation. It provides data in formats like CSV, JSON, and XLSX to promote transparency and innovation. URL: https://data.gov/
World Bank Open Data: Provides access to global development data, including indicators on poverty, health, education, and economics. Tools for visualisation and analysis are also available. URL: https://databank.worldbank.org/
Google Trends: Allows exploration of search trends across time and regions, which can be useful for understanding consumer behavior or tracking specific phenomena. URL: https://trends.google.com/trends/
OpenStreetMap: Offers free geographic data like maps, street layouts, and points of interest, suitable for research in urban planning and transportation. URL: https://www.openstreetmap.org/ UN Data: A platform with diverse datasets on global issues like population, education, and health, curated by the United Nations. URL: https://data.un.org/
Climate Data Online (NOAA): Focused on climate-related datasets, this platform includes historical weather data, marine data, and more for researchers and policymakers. URL: https://www.ncei.noaa.gov/cdo-web/
European Data Portal: For those interested in European datasets, this platform aggregates data on demographics, economics, and other regional statistics. URL: https://data.europa.eu/en

The list below describes some sources of open African data.

World Bank Open Data - Africa Development Indicators: This database provides a detailed collection of macroeconomic, sectoral, and social indicators for 53 African countries, with data spanning multiple decades. Topics include poverty, education, and social development. It is a rich resource for monitoring development programs in the region. URL: https://databank.worldbank.org/source/africa-development-indicators
African Data Portal by NBER: The National Bureau of Economic Research hosts a portal specifically for datasets on Sub-Saharan Africa, systematising public-use data. This includes diverse economic, demographic, and health data that are accessible for researchers. URL: https://www.nber.org/research/data/portal-public-use-datasets-sub-saharan-africa
Humanitarian Data Exchange (HDX): Managed by the UN Office for the Coordination of Humanitarian Affairs, HDX provides open data sets on humanitarian issues across Africa. Topics include conflict, health, food security, and migration. URL: https://data.humdata.org/
Global Biodiversity Information Facility (GBIF): GBIF offers biodiversity data with extensive records on species found in Africa. This resource supports research in conservation, climate change, and ecosystem services. URL: https://www.gbif.org/