2. Summarizing categorical data

Introduction

This vignette builds upon the sample data for São Miguel (introduced in the previous vignette) to demonstrate the use of exactextractr with categorical land cover data. A sample of the CORINE 2018 raster dataset is included in exactextractr.

As in the previous vignette, the following packages are used:

library(exactextractr)
library(dplyr)
library(sf)
library(raster)

Loading the sample data

First, we load the CORINE land cover data and the concelho boundaries.

clc <- raster(system.file('sao_miguel/clc2018_v2020_20u1.tif',
                     package = 'exactextractr'))
concelhos <- st_read(system.file('sao_miguel/concelhos.gpkg',
                                 package = 'exactextractr'),
                     quiet = TRUE)

The land cover class descriptions are provided in a separate DBF file. We read this in to a data frame, then use levels() to associate the class descriptions with the raster.

clc_classes <- foreign::read.dbf(system.file('sao_miguel/clc2018_v2020_20u1.tif.vat.dbf',
                                             package = 'exactextractr'),
                                 as.is = TRUE) %>%
  dplyr::select(value = Value,
                landcov = LABEL3)

levels(clc) <- list(data.frame(ID = clc_classes$value,
                               landcov = clc_classes$landcov))

This association provides us with a way to look up the description for a given ID. Alternatively, we can relate the values using merge or a `dplyr join.

factorValues(clc, c(2, 18, 24))
#>                      landcov
#> 1 Discontinuous urban fabric
#> 2                   Pastures
#> 3          Coniferous forest

Summarizing land cover classifications

One of the most basic questions we might ask is which land cover classification is predominant in each concelho. We can do this with the built-in mode summary operation. The minority and variety operations are also applicable to categorical data and provide the least-common classification and number of distinct classifications, respectively.

landcov_mode <- exact_extract(clc, concelhos, 'mode', 
                              append_cols = 'name', progress = FALSE) %>%
  inner_join(clc_classes, by=c(mode = 'value'))

name	landcov
Lagoa	Land principally occupied by agriculture, with significant areas of natural vegetation
Nordeste	Coniferous forest
Ponta Delgada	Pastures
Povoação	Broad-leaved forest
Ribeira Grande	Land principally occupied by agriculture, with significant areas of natural vegetation
Vila Franca do Campo	Land principally occupied by agriculture, with significant areas of natural vegetation

Summary functions

While mode provides a handy way to see the most common land cover category, we need to write a custom summary function if we want to see the frequency of different land cover types in an area.

Summary functions are called once per feature from the input sf object. They can return either:

a scalar, in which case the return value of exact_extract will be a vector whose entries correspond with the rows of the input sf object, or
a data frame, in which case exact_extract will return a rowwise combination of the data frames for each feature. If the data frame returned by the summary function will have than a single row, it is useful for some identifying information to be included in the returned data frame.

If we are going to perform typical data frame operations on the raster values and coverage fractions, it can be more convenient for the summary function to accept a single data frame argument, instead of separate arguments for the cell values and coverage fractions. This behavior can be enabled with the summarize_df argument.

Using this method, we can calculate the fraction of each concelho that is covered by each land cover category:

landcov_fracs <- exact_extract(clc, concelhos, function(df) {
  df %>%
    mutate(frac_total = coverage_fraction / sum(coverage_fraction)) %>%
    group_by(name, value) %>%
    summarize(freq = sum(frac_total))
}, summarize_df = TRUE, include_cols = 'name', progress = FALSE)

Here we use the include_cols argument here to include the name column from concelhos in the data frame passed to the summary function. (Although the value of name will be the same for all rows in df, we include name in the group_by expression so that it is not removed by summarize.) Other similar arguments include include_xy to get the cell center coordinates, include_area to get the cell area, and include_cell to get the cell index used by the raster package.

This provides us with a correspondence between each numeric land cover category and its frequency in each concelho:

head(landcov_fracs)
#> # A tibble: 6 × 3
#> # Groups:   name [1]
#>   name  value    freq
#>   <chr> <dbl>   <dbl>
#> 1 Lagoa     2 0.0630 
#> 2 Lagoa     3 0.0136 
#> 3 Lagoa     7 0.00659
#> 4 Lagoa     9 0.00637
#> 5 Lagoa    12 0.170  
#> 6 Lagoa    18 0.154

Joining this table to clc_classes, we can associate the descriptions with the numeric types and view the three most common land cover classes in each concelho:

landcov_fracs %>%
  inner_join(clc_classes, by = 'value') %>%
  group_by(name) %>%
  arrange(desc(freq)) %>%
  slice_head(n = 3) %>%
  mutate(freq = sprintf('%0.1f%%', 100*freq)) %>%
  knitr::kable()

name	value	freq	landcov
Lagoa	21	26.8%	Land principally occupied by agriculture, with significant areas of natural vegetation
Lagoa	12	17.0%	Non-irrigated arable land
Lagoa	20	15.6%	Complex cultivation patterns
Nordeste	24	24.3%	Coniferous forest
Nordeste	18	22.5%	Pastures
Nordeste	21	14.2%	Land principally occupied by agriculture, with significant areas of natural vegetation
Ponta Delgada	18	29.9%	Pastures
Ponta Delgada	21	26.5%	Land principally occupied by agriculture, with significant areas of natural vegetation
Ponta Delgada	12	11.3%	Non-irrigated arable land
Povoação	23	24.2%	Broad-leaved forest
Povoação	18	20.1%	Pastures
Povoação	21	16.7%	Land principally occupied by agriculture, with significant areas of natural vegetation
Ribeira Grande	21	25.1%	Land principally occupied by agriculture, with significant areas of natural vegetation
Ribeira Grande	18	22.7%	Pastures
Ribeira Grande	12	16.4%	Non-irrigated arable land
Vila Franca do Campo	21	36.0%	Land principally occupied by agriculture, with significant areas of natural vegetation
Vila Franca do Campo	18	15.5%	Pastures
Vila Franca do Campo	27	15.0%	Moors and heathland

Similarly, we can find the top land covers by area:

landcov_areas <- exact_extract(clc, concelhos, function(df) {
  df %>%
    group_by(name, value) %>%
    summarize(area_km2 = sum(coverage_area) / 1e6)
}, summarize_df = TRUE, coverage_area = TRUE, include_cols = 'name', progress = FALSE)

name	area_km2	landcov
Lagoa	12.243526	Land principally occupied by agriculture, with significant areas of natural vegetation
Lagoa	7.749522	Non-irrigated arable land
Lagoa	7.129778	Complex cultivation patterns
Nordeste	24.662020	Coniferous forest
Nordeste	22.832392	Pastures
Nordeste	14.404089	Land principally occupied by agriculture, with significant areas of natural vegetation
Ponta Delgada	69.843899	Pastures
Ponta Delgada	61.752290	Land principally occupied by agriculture, with significant areas of natural vegetation
Ponta Delgada	26.315411	Non-irrigated arable land
Povoação	25.798260	Broad-leaved forest
Povoação	21.438728	Pastures
Povoação	17.821336	Land principally occupied by agriculture, with significant areas of natural vegetation
Ribeira Grande	45.394998	Land principally occupied by agriculture, with significant areas of natural vegetation
Ribeira Grande	40.902179	Pastures
Ribeira Grande	29.625931	Non-irrigated arable land
Vila Franca do Campo	28.120871	Land principally occupied by agriculture, with significant areas of natural vegetation
Vila Franca do Campo	12.081576	Pastures
Vila Franca do Campo	11.681901	Moors and heathland

Summarizing population land cover

One extension of the analysis above is to see which land covers are associated with human population in a given concelho. Is the population primary urban or rural?

As described in the previous vignette, the population density raster provides the most robust results in the presence of partially-covered pixels.

pop_density <- raster(system.file('sao_miguel/gpw_v411_2020_density_2020.tif',
                                  package = 'exactextractr'))

We are able to perform this analysis because the CORINE sample distributed with exactextractr has been reprojected from its native Lambert Equal Area projection into geographic coordinates consistent with GPW. Otherwise, working with multiple rasters in different projections requires transformation to a common grid using tools such as raster::projectRaster or the gdalwarp command-line utility.

Very little about the call to exact_extract requires changing to incorporate population. We set weights = pop_density and, since we are using the summarize_df option, a column called weight will be added to the data frame passed to the summary function. We also set coverage_area = TRUE so that we can multiply the density by the covered pixel area to get a population count.

landcov_pop_areas <- exact_extract(clc, concelhos, function(df) {
  df %>%
    group_by(name, value) %>%
    summarize(pop = sum(coverage_area * weight / 1e6)) %>%
    mutate(pop_frac = pop / sum(pop))
}, weights = pop_density, coverage_area = TRUE,
   summarize_df = TRUE, include_cols = 'name')

Looking at the highest-population land cover type in each concelho, we can can see that the western/central concelhos of Lagoa, Ponta Delgada, Ribeira Grande, and Vila Franca do Campo have a more urban population than Nordeste or Povoação to the east.

name	landcov	pop	pop_frac
Lagoa	Discontinuous urban fabric	6548	0.415
Nordeste	Complex cultivation patterns	1280	0.282
Ponta Delgada	Discontinuous urban fabric	27224	0.381
Povoação	Complex cultivation patterns	1568	0.261
Ribeira Grande	Discontinuous urban fabric	13497	0.374
Vila Franca do Campo	Discontinuous urban fabric	4442	0.377