A Closer Look at COVID-19 Primary Data

Posted June 17, 2020 in Cutter Business Technology Journal
primary data

We have all been witnessing the COVID-19 pandemic not only impact human health but also affect the economy — on a massive scale. In the US alone, as of 31 May 2020, over 100,000 people have died of COVID-19, and the economic impact of COVID-19 on the US economy has been reported thus far as US $2.14 trillion. There is no possibility of a vaccine or definite cure for COVID-19 in the short term; indeed, it is estimated that a vaccine will not be ready before 2021 (at the earliest). Strict social distancing has been the primary solution to contain the spread and the impact of the virus. However, social distancing clearly disrupts the economy and our livelihoods in a dramatic way. In such an alarming situation, efficient management of the pandemic is key. Such management includes:

  • Instead of strict social distancing with a lockdown in all locations, selectively identify any hotspots for pandemic recurrence and enforce lockdown in a confined area. This will restrict the economic hardship to the hotspot region without greatly impacting areas that have not seen a recurrence of viral infections.

  • Dynamically manage hospital and other care resources by, for example, arranging for isolation beds, intensive care units (ICUs), and ventilators based on the resurgence of critically ill patients.

  • Selectively open closed-down areas so that the economy can gradually come back to normal.

  • Prescribe the appropriate level of social distancing at local and regional levels, depending on the extent of viral infections.

However, doing all this requires getting data, analyzing the data, and building models. This presents two challenges: (1) collecting the data and (2) building models using the data. In a recent Executive Update, we discussed the existing approaches and techniques employed to address these challenges. 

There are two types of data involving COVID-19: primary data and secondary data. As we explore in this Advisor, primary data relates directly to the pandemic and measures outcomes. Examples of such data include: (1) number of COVID-19-infected people; (2) number of COVID-19 patients admitted to hospital; (3) number of COVID-19 patients in ICU; and (4) number of COVID-19-related deaths. 

Number of COVID-19-Infected People

This data is very much dependent on the extent of testing. It has been demonstrated across the world that more testing equates to the identification of more infected people. Thus, examining this data without also looking at the actual number of tests performed is not that useful. In addition, testing bias must be considered. Due to the unavailability of test kits in most places, testing is generally being done only on those who report COVID-19-like symptoms. As people infected with COVID-19 may not show any common symptoms, the number of positive cases reported underestimates the actual infection spread. Underestimation is corroborated by the recent antibody testing of 3,000 random New York residents, which estimated that over 20% of the New York City population have anti­bodies to COVID-19, indicating that a much higher number of people than previously identified had been infected (and most probably had not been tested for infection). US state-level data is available at The New York Times (NYT) website, and county-level data is typically available through state healthcare divisions. For example, the Florida Health Department is releasing state and county-level data. Various research groups such as those from the University of Washington, the University of Florida, and Carnegie Mellon University are accumulating additional state-level data.

Number of COVID-19 Patients Admitted to Hospital, Number of COVID-19 Patients in ICU

Hospital admission and ICU admission data come from hospital records, which provide data that is more accurate than the reported number of infected people. The challenge here is data aggregation, especially across multiple hospital systems. In some cases, multiple hospital systems in a region (such as Florida) are sharing this information. However, in the absence of government directives, different hospital systems are reporting this data in different formats, making data aggregation at the country level and across multiple states difficult. Groups at the University of Washington are making data on the number of hospital admissions and ICU admissions at the state level available. Still, accuracy will depend on how frequently hospitals are sharing and exchanging data. If some hospitals are not exchanging data on a regular basis, then the consolidated data will not reflect the true scenario on the ground.

Number of COVID-19-Related Deaths

In most situations, COVID-19-related deaths data is based on the number of deaths reported by hospitals. However, the NYT reports that the total number of deaths since the start of the pandemic situation is much higher than the comparable numbers over that same period in the last few years. The number of reported deaths attributed to COVID-19 do not account for these excess deaths, meaning that the number of reported COVID-19 deaths is biased toward hospital-admitted COVID-19 patients. Some US counties and states have started reporting “probable cases” of death from COVID-19 along with confirmed cases, but as of yet, there is no standardization.

Public Data Sources and Dashboards

Several public sources of these four types of data are available, along with multiple geographic information system (GIS) dashboards developed in the last few weeks to present this data visually for easy compre­hension. These dashboards can help identify the macro-level spread of COVID-19. Examples of such data sources and dashboards include:

  • US Center for Disease Control and Prevention (CDC). The CDC provides consolidated data as reported by the health departments of US states and territories. The data is updated daily.

  • Johns Hopkins University. Johns Hopkins provides an interactive dashboard on confirmed cases and deaths in absolute, as well as per capita, numbers. The data is available down to the county level.

  • University of Washington. Washington was the first US state to report a COVID-19 case, and the University of Washington built the first set of prediction models. The university continues to predict positive case and death rates and has added hospital resource (bed and ICU bed) usage prediction.

  • The COVID Tracking Project. Launched by The Atlantic, this site also provides data about the number of people tested and hospitalization details (e.g., number hospitalized, in the ICU, on a ventilator) wherever available, in addition to the confirmed death count from COVID-19.

  • The New York Times. The NYT provides similar case and death data, but the NYT data counts incidences based on the location in which people are treated, which is not necessarily the same as patients’ area of residence.

[For more from the authors on this topic, see “A Data-Driven Approach to Managing COVID-19.”]

About The Author
Kaushik Dutta
Kaushik Dutta is a Professor, Department Chair, and Muma Fellow in the Department of Information Systems and Decision Sciences, Muma College of Business, University of South Florida. He has 22 years’ professional and research experience in the field of enterprise IT infrastructure, data analytics, and big data systems. Dr. Dutta's current interest is in the area of mobile advertisement, healthcare, and the application of blockchain in enterprise… Read More
Arindam Ray
Arindam Ray is a doctoral student in the Department of Information Systems and Decision Sciences, Muma College of Business, University of South Florida. Mr. Ray has more than 20 years' experience in the IT industry helping global Fortune 500 organizations in their digital transformation journey. He can be reached at