Event: Open Data Analysis with R

An event part of the Open Data Day initiative

Events
Community
Diversity
English
Autoras

Tatyane Paz Dominguez

Haydée Svab

Beatriz Milz

Ana Carolina Moreno

Ana Paula Rocha

Data de Publicação

March 26, 2023

Introduction

R-Ladies São Paulo organized, on March 18, 2023, the event “Open Data Analysis with R - Open Data Day.” The activity took place on a Saturday, during the morning and afternoon, with 6 hours of activities.

Insper, a non-profit institution dedicated to teaching and research, once again supported the group by providing the space for the event. Another support received was from Curso-R, which provided two teaching assistants to help out participants. The main objective of this event was to offer information about what open data is and its importance. Also, we aimed to promote the opportunity to do a “hands-on” activity exploring open data and tracking the current scenario of public data access at various levels of government and topics.

What is Open Data Day?

Open Data Day is an annual celebration of open data worldwide, organized and supported by the Open Knowledge Foundation. If you want to know more, check the official Open Data Day website and the project page on the Open Knowledge Brasil website.

Main activity

With about 40 people, we structured the event in four blocks. The first block was instructive and featured a sequence of brief presentations such as: what is the R-Ladies São Paulo community, what is Open Data Day and what is Open Data.

The objective was to make people feel free to work with open data, so we split the class into small working groups. Each group worked on a specific dataset and worked with a Teaching Assistant with experience analyzing that dataset, who guided the group to look into the data and understand them. So, the second block consisted of the following activities:

  • The explanation of how this activity would happen;

  • A brief speech (about 5 minutes) by each Teaching Assistant on the topic of their expertise and on the dataset that the group would work on, to facilitate the identification of participants with the subject;

  • And the separation of participants into groups according to the affinity of interests - everyone could choose the topic with more interest, and there was no need to resize or redistribute groups.

The topics and respective teaching assistants were:

In addition, teaching assistants were available to help with questions about R:

We created a Google Document so the groups could take and share notes.

Then, in the third block, the groups worked on importing, understanding, and starting to explore the open data of their respective themes. As many people pointed out that they had no experience creating data visualizations and were interested in getting started, co-organizer Beatriz Milz gave a short live coding presentation on using the Esquisse package to generate data visualizations with ggplot2.

In the fourth and last block, the groups presented some of their difficulties, lessons learned, and results (some even showed visualizations made!). In addition, interesting presentations reflected the topics addressed in the initial presentations on open data - for example, some groups indicated dealing with an outdated database, others pointed out that data was aggregated and could be made available with a greater level of detail, etc.

The experience was outstanding, and the organizers and participants demonstrated that they enjoyed the activity a lot (especially the children!). Each group had between two to six people in addition to the teaching assistants, which allowed for a more individualized follow-up to answer questions.

Strengthening the community

Two interesting points to highlight are the collaborative coffee and the Gugudadados Space.

Collaborative coffee

We made a collaborative coffee with items purchased by the organizing group (with the money received from the scholarship offered by OKBR) and food brought by participants. This way, people could get up and get coffee and something to eat at any time during the event. This coffee format (available the entire time of the event) is very good for three reasons: (i) it respects the time of the groups who can take their breaks as the work progresses; (ii) it welcomes participants who, for health reasons, cannot go without eating for many hours and (iii) welcomes participants who, due to socio-economic conditions, are unable to have a meal during their lunch break. Given the nature of the R-Ladies group, it is an important aspect to provide a welcoming environment so that everyone can enjoy the experience of the event regardless of having something to eat throughout the day, in addition to the fact that the collaborative coffee is also a way to encourage integration between people!

Gugudadados Space

We created the Gugudadados Space to facilitate the participation of people responsible for kids and babies (e.g., mothers, fathers, and caregivers). A baby and four children between 7 and 10 years old participated in this event. With the money from the scholarship offered by OKBR, it was possible to hire a recreational teacher in the Gugudadados Space (a room next to the event room, on the same floor) throughout the activity. The R-Ladies organizers also took toys, drawings, markers, games, and temporary tattoos to entertain and amuse the children.

Results from the groups during the event

Educational data (T.A. Ana Carolina Moreno)

The T.A. presented the group with an R code for analyzing data from the Census of Basic Education used to produce a series of special reports on the impact of the pandemic on early childhood education, which aired in November 2022. The monitor also prepared a presentation on educational data to assist in understanding the different sources and databases available on education.

The data used can be obtained from the INEP website.

Electoral data (T.A. Cecília do Lago):

This group explored electoral data for the 2022 elections, made available by the TSE. The group wanted to find out how many candidates did not receive one or zero votes in the 2022 election.

The data used can be obtained from the TSE’s Open Data Portal.

Environmental data on Fires (T.A. Bianca Muniz):

This group explored data on fires in INPE’s BDQueimadas system. This system allows anyone to download data for up to one year. The T.A. prepared apresentation about how to download and import this dataset. The group exported data from 2020 to 2022, and the graph below shows the number of fires per month according to the biome where the fires occurred. It is possible to see in the graph that the biomes with the biggest amount of fires are the Amazon and the Cerrado, and they show seasonal patterns. For example, the highest peak of fires in the Amazon, from 2020 to 2022, was the second semester of 2022.

The data used can be obtained from the INPE - BD QUEIMADAS website.

Prison data (T.A. Thandara Santos):

This group explored data from SISDEPEN - Statistical Data of the Brazilian Penitentiary System. These data come from the Prison Information Form, answered electronically every six months by government employees.

One of the difficulties presented by the group is the availability of data to be aggregated by prison unit rather than by individuals, which makes it impossible to do several fundamental analyses on the prison population in Brazil. Another area for improvement is the lack of standardization of the answers presented in the database, which implies low data reliability.

The data can be obtained from the National Secretariat for Penal Policies website.

Violence data (T.A. Fernanda Peres):

This group explored public data on violence using data from SINAN - Information System for Aggravation of Notifications, filtering occurrences involving only adults and removing self-inflicted violence (for example, suicide). The group found that the public data on Violence in DataSUS were outdated. The dataset for 2020 was incomplete, so the group explored data for 2019. The group generated a series of graphs, such as the one below, showing that the largest number of victims are woman. In addition, it is noteworthy that the author of the aggression is most often male (whether the victim is a woman or a man).

Another thing pointed out by the group is that when the victims were women, the aggressor tended to be someone they knew, such as a spouse, ex-spouse, boyfriend, or ex-boyfriend. On the other hand, among men, the most common aggressor is a stranger.

This data can be downloaded from DATASUS. The T.A. gave a presentation on how to get this data.

Employment data group (T.A. Ana Paula):

The T.A. gave apresentation on how to import this data. First, the group explored two databases from Caged (General Register of Employed and Unemployed) from the Central Bank: the number of total jobs from 2000 to 2023; and the number of jobs in the manufacturing industry (any raw material that is processed) from 2000 to 2023.  The data can be obtained with the GetBCBData package, making it possible to search for updated data aggregated by month/year and time series I.D.

# Loading packages
library(tidyverse)
library(GetBCBData)

# Importing data
dados_sgs <-  GetBCBData::gbcbd_get_series(
  id = c('NCaged' = 28763, 'NCaged_IndTransf'= 28766),
  first.date = '2020-01-01',
  last.date = Sys.Date(),
  format.data = 'wide'
)

# Looking at the content
glimpse(dados_sgs)
Rows: 37
Columns: 3
$ ref.date         <date> 2020-01-01, 2020-02-01, 2020-03-01, 2020-04-01, 2020…
$ NCaged           <dbl> 37938640, 38155900, 37860843, 36879150, 36480704, 364…
$ NCaged_IndTransf <dbl> 6924768, 6961283, 6918309, 6705220, 6600422, 6593132,…
# Pivoting in order to prepare the dataset to ggplot2
dados_sgs_longo <- dados_sgs %>% 
  pivot_longer(cols = c(NCaged, NCaged_IndTransf))

# Creating a graph
ggplot(dados_sgs_longo) +
 aes(x = ref.date, y = value, colour = name) +
 geom_line(show.legend = FALSE) +
 scale_color_hue(direction = 1) +
 theme_minimal() +
 facet_wrap(vars(name), scales = "free_y")

To learn more about this database, consult the Central Bank of Brazil Time Series Management System website.

Information about participants

We want to increase the diversity in the events, so, in this edition, we separated a percentage of spots thinking about three groups:

- BIPOC (Black, Indigenous, and people of color)

- mothers;

- women and other gender minorities.

In this event, 54 people signed up, and 37 participated. Below are some graphs showing information about the diversity of the participants.

There is still a lot of work to do to include unrepresented groups at events, but compared to pre-pandemic events, we are making progress. It is essential to expand the publicity for the following events, especially for BIPOC (Black, Indigenous, and people of color), mothers, and trans and non-binary people. Also, it is important to look for venues to make events in the periphery.

Difficulties

The main difficulty in organizing the event was the short time available for publicizing the activity, since the date and place were defined only six days in advance. Despite the available room having a capacity for 100 people, only 40 had enough time to organize themselves and register as participants.

However, the presence of more than 40 people, including people from the organization, was enough to carry out the activity, with the presence of people really interested in the topic.

Commentaries

In addition to the description offered by the organization’s team, we would also like to share two commentaries shared by people who participated in the event:

Tatiana Peixoto:

Hello, I’m Tatiana, a 42-year-old cis woman, black, environmental engineer, and passionate about studying. I signed up for the event, not only for the training but also for the search for optimism, thus creating strength to enter the professional market again, as I spent a long time in the academic environment. On 18.03.23th, I understood that R-ladies is much more than an introductory programming event. This team brings a new look at programming, removing blockages and obstacles for all those who want to follow a beautiful path in data science. The event is of excellent quality. I felt welcomed, cared for, and very well treated. I would definitely attend other R-ladies events. In addition to all the kindness of the Insper team in receiving us. I hope that R-ladies SP will continue with new events and new projects because, like me, they will make a lot of people happy.

Juliana Soprani:

I take this opportunity to reinforce my congratulations on the event! It was really relevant!!! Not only technically, in learning data analysis and the use of R, but also on the issue of welcoming, triggering a sense of belonging to a group or purpose. For someone starting in the data area, or in a career transition, as in my case, it was very important to see so many women from different areas, ages, and backgrounds working with data and R and engaged in expanding access to the tool for minorities. I am grateful for the opportunity and hope to contribute to health data analysis in the future.

Looking forward to upcoming meetings.

Support

It is important to emphasize the importance of OKBR ’s financial support, which enabled the purchase of items for the coffee break, stickers, and hiring a recreator. 

The rooms offered by Insper were crucial for the event to take place. The building is easily accessible by public transport. The meeting took place in a large space with internet access, tables, comfortable chairs, and easy access to a restaurant for lunch. The toboggan that is part of the building’s facilities is also a success and makes up one of the most joyful experiences for children who stay at Gugudadados.

Curso-R also supported the event, providing two R teachers to assist in the activity and being available to help with questions from the participants.

Team

This event was only possible with the collaboration of several people. Therefore, here is a list of people who participated in the various stages of organizing the event:

The event would not be the same without your collaboration - we appreciate and greatly appreciate your participation!

In addition, we also thank everyone who participated!

Next events

This was the first time we held an event with the idea of ​​working in “groups”, and it is certainly a format that worked well (the participants indicated that they preferred this way to expository lectures). We intend to organize other events in this format!

The next R-Ladies São Paulo event is scheduled for May, with a theme yet to be defined. If you are interested in participating, we recommend following our social media!