The rbind_pages function is used to combine a list of data frames into a single data frame. This is often needed when working with a JSON API that limits the amount of data per request. If we need more data than what fits in a single request, we need to perform multiple requests that each retrieve a fragment of data, not unlike pages in a book. In practice this is often implemented using a page parameter in the API. The rbind_pages function can be used to combine these pages back into a single dataset.

rbind_pages(pages)

Arguments

pages

a list of data frames, each representing a page of data

Details

The rbind_pages function uses vctrs::vec_rbind() to bind the pages together. This generalizes base::rbind() in two ways:

  • Not each column has to be present in each of the individual data frames; missing columns will be filled up in NA values.

  • Data frames can be nested (can contain other data frames).

Examples

# Basic example
x <- data.frame(foo = rnorm(3), bar = c(TRUE, FALSE, TRUE))
y <- data.frame(foo = rnorm(2), col = c("blue", "red"))
rbind_pages(list(x, y))
#>          foo   bar  col
#> 1 -1.8218177  TRUE <NA>
#> 2 -0.2473253 FALSE <NA>
#> 3 -0.2441996  TRUE <NA>
#> 4 -0.2827054    NA blue
#> 5 -0.5536994    NA  red

# \donttest{
baseurl <- "https://projects.propublica.org/nonprofits/api/v2/search.json"
pages <- list()
for(i in 0:20){
  mydata <- fromJSON(paste0(baseurl, "?order=revenue&sort_order=desc&page=", i))
  message("Retrieving page ", i)
  pages[[i+1]] <- mydata$organizations
}
#> Retrieving page 0
#> Retrieving page 1
#> Retrieving page 2
#> Retrieving page 3
#> Retrieving page 4
#> Retrieving page 5
#> Retrieving page 6
#> Retrieving page 7
#> Retrieving page 8
#> Retrieving page 9
#> Retrieving page 10
#> Retrieving page 11
#> Retrieving page 12
#> Retrieving page 13
#> Retrieving page 14
#> Retrieving page 15
#> Retrieving page 16
#> Retrieving page 17
#> Retrieving page 18
#> Retrieving page 19
#> Retrieving page 20
organizations <- rbind_pages(pages)
nrow(organizations)
#> [1] 525
colnames(organizations)
#>  [1] "ein"           "strein"        "name"          "sub_name"     
#>  [5] "city"          "state"         "ntee_code"     "raw_ntee_code"
#>  [9] "subseccd"      "has_subseccd"  "have_filings"  "have_extracts"
#> [13] "have_pdfs"     "score"        
# }