In this post we are going to show some visualizations of COVID-19 data obtained from the Buenos Aires Datasets webpage. This dataset is updated daily and contains the information of new cases and deaths in Buenos Aires, together with some extra information such as age and neighborhood.
The goal is to show how to work with this data, together with some geographical data, to obtain some nice visualizations using Bokeh.
The dataset
The dataset downloaded from Buenos Aires Datasets webpage looks like this:
|
numero_de_caso |
fecha_apertura_snvs |
fecha_toma_muestra |
fecha_clasificacion |
provincia |
barrio |
comuna |
genero |
edad |
clasificacion |
fecha_fallecimiento |
fallecido |
fecha_alta |
tipo_contagio |
0 |
15399546 |
24JUN2021:00:00:00.000000 |
27JUN2021:00:00:00.000000 |
27JUN2021:00:00:00.000000 |
Buenos Aires |
nan |
nan |
femenino |
53 |
confirmado |
nan |
nan |
08JUL2021:00:00:00.000000 |
Comunitario |
1 |
15420990 |
24JUN2021:00:00:00.000000 |
24JUN2021:00:00:00.000000 |
24JUN2021:00:00:00.000000 |
CABA |
PARQUE PATRICIOS |
4 |
femenino |
61 |
confirmado |
nan |
nan |
08JUL2021:00:00:00.000000 |
Comunitario |
2 |
15426848 |
24JUN2021:00:00:00.000000 |
28JUN2021:00:00:00.000000 |
28JUN2021:00:00:00.000000 |
Buenos Aires |
nan |
nan |
femenino |
39 |
confirmado |
nan |
nan |
08JUL2021:00:00:00.000000 |
Comunitario |
3 |
15476146 |
25JUN2021:00:00:00.000000 |
25JUN2021:00:00:00.000000 |
25JUN2021:00:00:00.000000 |
Buenos Aires |
nan |
nan |
masculino |
42 |
confirmado |
nan |
nan |
08JUL2021:00:00:00.000000 |
Comunitario |
4 |
15494419 |
25JUN2021:00:00:00.000000 |
25JUN2021:00:00:00.000000 |
25JUN2021:00:00:00.000000 |
CABA |
RECOLETA |
2 |
femenino |
74 |
confirmado |
nan |
nan |
08JUL2021:00:00:00.000000 |
Comunitario |
Let’s say that we want to understand the number of cases and the number of deaths by date. First of all, we only keep the data of confirmed cases. Secondly, we need to turn the date into a readable format.
Now, we can use the group by statement to count the number of cases (we count on the unique case-id number).
To count the number of deaths, we need to filter by the column fallecido
, and again use the group by statement to count the number of deaths.
We can put this data together by merging the two tables
Finally, we can add a 7-day moving average. This will help us better understand the tendency by smoothing out the daily variation we’ll see in the plot.
We have all that we need to create our first plot! This is obtained by using BOKEH visualization library for Python.
We can plot the number of cases by date
and the number of deaths by date
Clicking on the legend allows you to hide the plot.
Geographic datasets
Just like we use Pandas to analyze datasets, we can use GeoPandas to analyze geographic datasets. Moreover, we can find a map of Buenos Aires together with its division into neighborhoods and census data by area in the Buenos Aires Data webpage.
How do these geographic datasets look like? Essentially like a usual Pandas dataset, with an extra geometry
column, which will be used to draw the shape of our data.
|
barrio |
comuna |
perimetro |
area |
geometry |
0 |
CHACARITA |
15 |
7724.85 |
3.11571e+06 |
POLYGON ((-58.4528200492791 -34.5959886570639, … )) |
1 |
PATERNAL |
15 |
7087.51 |
2.22983e+06 |
POLYGON ((-58.4655768128541 -34.5965577078058, … )) |
2 |
VILLA CRESPO |
15 |
8131.86 |
3.61598e+06 |
POLYGON ((-58.4237529813037 -34.5978273383243, … )) |
3 |
VILLA DEL PARQUE |
11 |
7705.39 |
3.3996e+06 |
POLYGON ((-58.4946097568899 -34.6148652395239, … )) |
4 |
ALMAGRO |
5 |
8537.9 |
4.05075e+06 |
POLYGON ((-58.4128700313089 -34.6141162515854, … )) |
As you can see, the column geometry
contains the necessary latitude-longitude coordinates to plot the points corresponding to the shape of each neighborhood.
Now we will aggregate our original table by neighborhood and merge it together with this table, so that we keep the number of cases by neighborhood and the shape of each neighborhood in the same table. Also, we will be able to find out the population of each neighborhood from the census data.
This is sufficient information to display the information on the map.
The following two maps show the number of covid cases per 1000 people in each neighborhood, and of deaths per 1000 people in each neighborhood.
Finally, we can also plot the number of new cases and deaths in each neighborhood by month, and see how it changes with time. For this, we will need to play with our data a bit.
We can keep track of the month and year, and aggregate the number of cases corresponding to each neighborhood during each period using the pivot table function.
We obtain in this way a table like this
barrio |
Apr2020 |
Apr2021 |
Aug2020 |
Aug2021 |
Dec2020 |
Feb2021 |
Jan2021 |
Jul2020 |
Jul2021 |
Jun2020 |
Jun2021 |
Mar2020 |
Mar2021 |
May2020 |
May2021 |
Nov2020 |
Oct2020 |
Sep2020 |
AGRONOMIA |
4 |
285 |
100 |
55 |
57 |
91 |
156 |
70 |
95 |
28 |
167 |
3 |
114 |
5 |
252 |
38 |
72 |
115 |
ALMAGRO |
45 |
3373 |
1607 |
306 |
727 |
1009 |
1578 |
1391 |
1142 |
716 |
1774 |
17 |
1364 |
179 |
3261 |
548 |
987 |
1456 |
BALVANERA |
54 |
3955 |
2297 |
319 |
757 |
938 |
1590 |
2271 |
1156 |
1326 |
2090 |
42 |
1549 |
328 |
3954 |
575 |
1061 |
1784 |
BARRACAS |
23 |
2674 |
1617 |
148 |
491 |
479 |
824 |
1825 |
653 |
1883 |
1380 |
3 |
858 |
470 |
2605 |
275 |
504 |
992 |
BELGRANO |
60 |
3008 |
1065 |
418 |
875 |
1011 |
2089 |
852 |
1057 |
274 |
1568 |
30 |
1463 |
110 |
2582 |
477 |
848 |
1127 |
where the column for each period has the number of cases in the corresponding neighborhood.
We can now merge this table with the neighborhoods
one, so that we have the number of cases by month for each neighborhood, together with the geographical data.
This is all we need to do our final plot.