Working with the Report-level Data

The original version of the MMAD is coded at the level of reports, which can be aggregated to the event level in different ways. As an example, we show how to generate the pre-aggregated event-level data we distribute along with the report-level version. Users should modify this procedure according to their needs.

We load the CSV version of the data. Adjust the filename to match the version you are using.

reports <- read.csv("reports.csv")

We load R’s dplyr package for data management.

library(dplyr)

When aggregating data (for example, computing the average number of participants across different reports), R returns a missing value (NA) as soon as one of the values is missing (for example, if one of the reports has no information about the number of participants). This is not the behavior we want. We therefore introduce slightly modified aggregation functions that return NA only if all the input values are NA, but otherwise use the existing information.

max.new <- function(v) {
    if (all(is.na(v))) { return(NA) } else { return(max(v, na.rm=T)) }
}

mean.new <- function(v) {
    if (all(is.na(v))) { return(NA) } else { return(mean(v, na.rm=T)) }
}

Now, we can perform the aggregation using the functions from the dplyr package. Note that we aggregate reports with the same city, date and side. The additional variables (cowcode, asciiname, longitude, latitude) are not necessary for the aggregation, but included for convenience. The aggregation uses the new functions defined above. This is the statement to modify if the user needs alternative aggregation rules.

events_new <- reports %>% group_by(cowcode, location, asciiname, longitude, latitude, event_date, side) %>% summarise(numreports = n(), maxscope = max.new(scope), maxpartviolence = max.new(part_violence), maxsecengagement = max.new(sec_engagement), mean_avg_numparticipants = mean.new(avg_numparticipants))