COVID-19 Analysis with Python

Updated: Jun 24

Create a complete COVID-19 report using Python and its powerful data science packages



Click here to view this article on Medium


Python is a highly powerful general-purpose programming language that can be easily learned and provides data scientists a wide variety of tools and packages. Amid this pandemic period, I decided to do an analysis of this novel coronavirus.


In this article, I am going to walk you through the steps I undertook for this analysis with visuals and code snippets.


Steps involved in Data Analysis:


1. Importing required packages

2. Gathering Data

3. Transforming Data to our needs (Data Wrangling)

4. Exploratory Data Analysis (EDA) and Visualization


Step - 1: Importing required Packages

Importing our required packages is the starting point of all data analysis programming in python. As I've said, python provides a wide variety of packages for data scientists and in this analysis, I used python's most popular data science packages Pandas and NumPy for Data Wrangling and EDA. When coming to Data Visualization, I used python's interactive packages Plotly and Matplotlib. It's very simple to import packages in python code:


This is the code for importing our primary packages to perform data analysis but still, we need to add some more packages to our code which we will see in step-2. Yay! We successfully finished our first step.


Step - 2: Gathering Data


For a clean and perfect data analysis, the foremost important element is collecting quality Data. For this analysis, I've collected many data from various sources for better accuracy.

Our primary dataset is extracted from Esri (a website that provides updated data on coronavirus) using a query URL (click here to view the website). Follow the code snippets to extract the data from Esri:


Requests is a python package used to extract data from a given JSON file. In this code, I used requests to extract data from the given query URL by Esri. We are now ready to do some Data Wrangling! (Note: We will be importing many data in step-4 of our analysis)


Step - 3: Data Wrangling


Data Wrangling is a process where we will transform and clean our data to our needs. We can't do analysis with our raw extracted data. So, we have to transform the data to proceed with our analysis. Here's the code for our Data Wrangling:



Note that, we have imported a new python package, 'DateTime', which helps us to work with dates and times in a dataset. Now, get ready to see the big picture of our analysis -' EDA and Data Visualization'.


Step - 4: Exploratory Data Analysis and Data Visualization


This process is quite long as it is the heart and soul of data analysis. So, I've divided this process into three steps:


a. Ranking countries and provinces (based on COVID-19 aspects)

b. Time Series on COVID-19 Cases

c. Classification and Distribution of cases


Ranking countries and provinces


From our previously extracted data, we are going to rank countries and provinces based on confirmed, deaths, recovered, and active cases by doing some EDA and Visualization. Follow the code snippets for the upcoming visuals (Note: Every visualization are interactive and you can hover them to see their data points)


Part 1 - Ranking Most affected countries


i) Top 10 Confirmed Cases Countries:


The following code will produce a plot ranking top 10 countries based on confirmed cases.



ii) Top 10 Death Cases Countries:


The following code will produce a plot ranking top 10 countries based on death cases.



iii) Top 10 Recovered Cases Countries:


The following code will produce a plot ranking top 10 countries based on recovered cases.



iv) Top 10 Active Cases Countries:


The following code will produce a plot ranking top 10 countries based on recovered cases.



Part 2 - Ranking most affected States of largely affected Countries:


EDA for ranking states in largely affected Countries:

We are extracting States' data from the USA, Brazil, India, and Russia respectively because these are the countries that are most affected by COVID-19. Now, let's visualize it!


Visualization of Most affected states in largely affected Countries:


i) Most affected States in the USA:


The following code will produce a plot ranking of the top 5 most affected states in the United States of America.



ii) Most affected States in Brazil:


The following code will produce a plot ranking of the top 5 most affected states in Brazil.



iii) Most affected States in India:


The following code will produce a plot ranking of the top 5 most affected states in India.



iv) Most affected States in Russia:


The following code will produce a plot ranking of the top 5 most affected states in Russia.



Time Series on COVID-19 Cases


To perform time series analysis on COVID-19 cases we need a new dataset. https://covid19.who.int/ Follow this link and images shown below for downloading our next dataset.

After pressing the link mentioned above, you will land on this page. On the bottom right of the represented map, you can find the download button. From there you can download the dataset and save it to your files. Good work! We fetched our Data! Let's import the data :


From the above-extracted dataset, we are going to perform two types of time series analysis, 'COVID-19 cases Worldwide' and 'Most affected countries over time'.


i) COVID-19 cases worldwide:


EDA for COVID-19 cases worldwide:

a) Cumulative cases worldwide:


The following code produces a time series chart of cumulative cases worldwide right from the beginning of the outbreak.


b) Cumulative death cases worldwide:


The following code produces a time series chart of cumulative death cases worldwide right from the beginning of the outbreak.


c) Daily new cases worldwide:


The following code produces a time series chart of daily new cases worldwide right from the beginning of the outbreak.


d) Daily death cases worldwide:


The following code produces a time series chart of daily death cases worldwide right from the beginning of the outbreak.


ii) Most affected countries over time:


EDA for Most affected countries over time:

Note that, we have extracted data of countries USA, Brazil, India, Russia, and Peru respectively as they are highly affected by COVID-19 in the world.


a) Most affected Countries' Cumulative cases over time


The following code will produce a time series chart of the most affected countries' cumulative cases right from the beginning of the outbreak.


b) Most affected Countries' cumulative death cases over time:


The following code will produce a time series chart of the most affected countries' cumulative death cases right from the beginning of the outbreak.


c) Most affected Countries' daily new cases over time:


The following code will produce a time series chart of the most affected countries' daily new cases right from the beginning of the outbreak.


d) Most affected Countries' daily death cases:


The following code will produce a time series chart of the most affected countries' daily death cases right from the beginning of the outbreak.


Case Classification and Distribution


Here we are going to analyze how COVID-19 cases are distributed. For this, we need a new dataset. https://www.kaggle.com/imdevskp/corona-virus-report Follow this link for our new dataset.


i) WHO Region-Wise Case Distribution:


For this analysis, we are going to use the 'country_wise_latest.csv' dataset which will come along with the downloaded Kaggle dataset. The following code produces a pie chart representing case distribution among WHO Region classification.




ii) Most affected Countries' case distribution:


For this analysis, we are going to use the same 'country_wise_latest.csv' dataset which we imported for the previous analysis.


EDA for Most affected countries' case distribution:


The following code will produce a pie chart representing the case classification on Most affected Countries.



iii) Most affected continents' Negative case vs Positive case percentage composition:


For this analysis, we need a new dataset. https://ourworldindata.org/coronavirus-source-data Follow this link to get our next dataset.


EDA for Negative case vs Positive case percentage composition :


The following code will produce a pie chart illustrating the percentage composition of Negative cases and Positive cases in most affected Continents.



Conclusion


Hurrah! We successfully completed creating our own COVID-19 report with Python. If you forgot to follow any above-mentioned steps I have provided the full code for this analysis below. Apart from our analysis, there is much more you can do with Python and its powerful packages. So don't stop exploring and create your own reports and dashboards. You can find many useful resources on the internet based on data science in python, for example, edX, Coursera, Udemy, and so on but, never ever stop learning. Hope you find this article useful and knowledgeable.


Happy Analyzing!


Full code:



600 views4 comments