Ignacio Darago

Visualizing COVID-19 Data

In this post we are going to show some visualizations of COVID-19 data obtained from the Buenos Aires Datasets webpage. This dataset is updated daily and contains the information of new cases and deaths in Buenos Aires, together with some extra information such as age and neighborhood.

The goal is to show how to work with this data, together with some geographical data, to obtain some nice visualizations using Bokeh.

The dataset

The dataset downloaded from Buenos Aires Datasets webpage looks like this:

import pandas as pd
import numpy as np

df = pd.read_csv("./Datasets/casos_covid19.csv")
  numero_de_caso fecha_apertura_snvs fecha_toma_muestra fecha_clasificacion provincia barrio comuna genero edad clasificacion fecha_fallecimiento fallecido fecha_alta tipo_contagio
0 15399546 24JUN2021:00:00:00.000000 27JUN2021:00:00:00.000000 27JUN2021:00:00:00.000000 Buenos Aires nan nan femenino 53 confirmado nan nan 08JUL2021:00:00:00.000000 Comunitario
1 15420990 24JUN2021:00:00:00.000000 24JUN2021:00:00:00.000000 24JUN2021:00:00:00.000000 CABA PARQUE PATRICIOS 4 femenino 61 confirmado nan nan 08JUL2021:00:00:00.000000 Comunitario
2 15426848 24JUN2021:00:00:00.000000 28JUN2021:00:00:00.000000 28JUN2021:00:00:00.000000 Buenos Aires nan nan femenino 39 confirmado nan nan 08JUL2021:00:00:00.000000 Comunitario
3 15476146 25JUN2021:00:00:00.000000 25JUN2021:00:00:00.000000 25JUN2021:00:00:00.000000 Buenos Aires nan nan masculino 42 confirmado nan nan 08JUL2021:00:00:00.000000 Comunitario
4 15494419 25JUN2021:00:00:00.000000 25JUN2021:00:00:00.000000 25JUN2021:00:00:00.000000 CABA RECOLETA 2 femenino 74 confirmado nan nan 08JUL2021:00:00:00.000000 Comunitario

Let’s say that we want to understand the number of cases and the number of deaths by date. First of all, we only keep the data of confirmed cases. Secondly, we need to turn the date into a readable format.

df = df[df["clasificacion"]=="confirmado"] # Keeps only the confirmed cases
df = df.dropna(subset=["fecha_apertura_snvs"]) # Keeps only the data with date

def transform_date(date):
    decoder = {"JAN":'01',"FEB" : '02',"MAR" : '03', "APR" : '04', "MAY" : '05', "JUN" : '06', "JUL" : '07',"AUG" : '08',"SEP" : '09',"OCT" : '10',"NOV" : '11',"DEC" : '12'}
    return date[:2] + "-" + decoder[date[2:5]] + "-" + date[5:9]

df["fecha_apertura_snvs"] = df["fecha_apertura_snvs"].apply(transform_date)
df["fecha_apertura_snvs"] = pd.to_datetime(df["fecha_apertura_snvs"], format="%d-%m-%Y")
df.sort_values(by="fecha_apertura_snvs")

Now, we can use the group by statement to count the number of cases (we count on the unique case-id number).

cases_by_date = df.groupby("fecha_apertura_snvs", as_index=False).count()
cases_by_date = cases_by_date[["fecha_apertura_snvs","numero_de_caso"]] # We keep only the dates and number of cases 
cases_by_date = cases_by_date.rename(columns={"fecha_apertura_snvs":"date", "numero_de_caso":"cases"}) # We rename those two columns to date and cases

To count the number of deaths, we need to filter by the column fallecido, and again use the group by statement to count the number of deaths.

deaths_by_date = df[df["fallecido"]=="si"].groupby("fecha_apertura_snvs", as_index=False).count()
deaths_by_date = deaths_by_date[["fecha_apertura_snvs","numero_de_caso"]] # We keep only the dates and number of cases 
deaths_by_date = deaths_by_date.rename(columns={"fecha_apertura_snvs":"date", "numero_de_caso":"deaths"}) # We rename those two columns to date and deaths

We can put this data together by merging the two tables

data_by_date = pd.merge(cases_by_date,deaths_by_date,on="date")

Finally, we can add a 7-day moving average. This will help us better understand the tendency by smoothing out the daily variation we’ll see in the plot.

data_by_date["casesMA"] = data_by_date["cases"].rolling(window=7).mean()
data_by_date["deathsMA"] = data_by_date["deaths"].rolling(window=7).mean()

We have all that we need to create our first plot! This is obtained by using BOKEH visualization library for Python.

We can plot the number of cases by date

and the number of deaths by date

Clicking on the legend allows you to hide the plot.

Geographic datasets

Just like we use Pandas to analyze datasets, we can use GeoPandas to analyze geographic datasets. Moreover, we can find a map of Buenos Aires together with its division into neighborhoods and census data by area in the Buenos Aires Data webpage.

How do these geographic datasets look like? Essentially like a usual Pandas dataset, with an extra geometry column, which will be used to draw the shape of our data.

import geopandas as gpd

neighborhoods = gpd.read_file("./Datasets/barrios.geojson")
census = gpd.read_file("./Datasets/caba_radios_censales.geojson")
  barrio comuna perimetro area geometry
0 CHACARITA 15 7724.85 3.11571e+06 POLYGON ((-58.4528200492791 -34.5959886570639, … ))
1 PATERNAL 15 7087.51 2.22983e+06 POLYGON ((-58.4655768128541 -34.5965577078058, … ))
2 VILLA CRESPO 15 8131.86 3.61598e+06 POLYGON ((-58.4237529813037 -34.5978273383243, … ))
3 VILLA DEL PARQUE 11 7705.39 3.3996e+06 POLYGON ((-58.4946097568899 -34.6148652395239, … ))
4 ALMAGRO 5 8537.9 4.05075e+06 POLYGON ((-58.4128700313089 -34.6141162515854, … ))

As you can see, the column geometry contains the necessary latitude-longitude coordinates to plot the points corresponding to the shape of each neighborhood.

Now we will aggregate our original table by neighborhood and merge it together with this table, so that we keep the number of cases by neighborhood and the shape of each neighborhood in the same table. Also, we will be able to find out the population of each neighborhood from the census data.

population_by_neighborhood = census[["BARRIO","POBLACION"]].groupby("BARRIO").sum()
neighborhoods = neighborhoods.join(on="barrio", other=population_by_neighborhood) # Aggregate data of population by neighborhood

neighborhoods = neighborhoods.join(on="barrio",other=df.dropna(subset=["barrio"]).groupby("barrio").count()["numero_de_caso"]) # Aggregate data of cases by neighborhood
neighborhoods = neighborhoods.join(on="barrio",other=df.dropna(subset=["barrio","fallecido"]).groupby("barrio").count()["fallecido"]) # Aggregate data of deaths by neighborhood

neighborhoods["case_density"] = neighborhoods["numero_de_caso"]/neighborhoods["POBLACION"]
neighborhoods["death_density"] = neighborhoods["fallecido"]/neighborhoods["POBLACION"]

neighborhoods = neighborhoods.rename(columns={"barrio":"neighborhood", "numero_de_caso":"cases", "POBLACION":"population", "fallecido":"deaths"}) # Rename the columns for clarity

This is sufficient information to display the information on the map.

The following two maps show the number of covid cases per 1000 people in each neighborhood, and of deaths per 1000 people in each neighborhood.

Finally, we can also plot the number of new cases and deaths in each neighborhood by month, and see how it changes with time. For this, we will need to play with our data a bit.

We can keep track of the month and year, and aggregate the number of cases corresponding to each neighborhood during each period using the pivot table function.

df["month-year"] = df["fecha_apertura_snvs"].apply(lambda x: x.month_name()[:3]+str(x.year)) # Keeps track of month and year
df["cases"] = df["numero_de_caso"].apply(lambda x:1) # Artificial column: made to aggregate with pivot_table
monthly_data = df.pivot_table(index="barrio", columns ="month-year", values = "cases", aggfunc=np.sum)
monthly_data = monthly_data.reset_index()

We obtain in this way a table like this

barrio Apr2020 Apr2021 Aug2020 Aug2021 Dec2020 Feb2021 Jan2021 Jul2020 Jul2021 Jun2020 Jun2021 Mar2020 Mar2021 May2020 May2021 Nov2020 Oct2020 Sep2020
AGRONOMIA 4 285 100 55 57 91 156 70 95 28 167 3 114 5 252 38 72 115
ALMAGRO 45 3373 1607 306 727 1009 1578 1391 1142 716 1774 17 1364 179 3261 548 987 1456
BALVANERA 54 3955 2297 319 757 938 1590 2271 1156 1326 2090 42 1549 328 3954 575 1061 1784
BARRACAS 23 2674 1617 148 491 479 824 1825 653 1883 1380 3 858 470 2605 275 504 992
BELGRANO 60 3008 1065 418 875 1011 2089 852 1057 274 1568 30 1463 110 2582 477 848 1127

where the column for each period has the number of cases in the corresponding neighborhood.

We can now merge this table with the neighborhoods one, so that we have the number of cases by month for each neighborhood, together with the geographical data.

cases_by_month = neighborhoods.merge(monthly_data)

This is all we need to do our final plot.

There’re still a lot of things we can do with these datasets and these tools: we can visualize the histogram of age for COVID-19 cases or deaths in each neighborhood, we can look at the correlation between cases or deaths and population density (since we know the area of each neighborhood is in the dataset!), just to name a few.

If you liked the plots and would like to see the code, you can find it on my Github page!