Skip to content

Latest commit

 

History

History
592 lines (468 loc) · 23 KB

README.md

File metadata and controls

592 lines (468 loc) · 23 KB

streetnamer

Lifecycle: experimental

The goal of streetnamer is to facilitate the matching of street name to their Wikidata identifiers.

This is a an early release version. Some elements of the interface work, some others don’t. We are aware that there’s still plenty of issues to be fixed. But if you’re tolerant to glitches, it will mostly work.

You have been warned.

A hosted version of this interface is available online. You can check it out there, contribute, and retrieve data.

On coding conventions

Technically, some things have been developed in line with best practices with attention to modularisation of the the web interface (components can be run and tested independently), some others have been written as quick and ugly hacks to fix things quickly.

Again, you have been warned.

Long term, the goal for streetnamer is to have a fully documented, consistently modularised, largely customisable, and easy to deploy Shiny app. Ideally, the interface would be available also in languages other than English. We are still quite far from that goal.

Installation

You can install the development version of streetnamer with:

install.packages("tidywikidatar") # On CRAN. Developed by yours truly: https://github.com/EDJNet/tidywikidatar/
# remotes::install_github("EDJNet/tidywikidatar") 

remotes::install_github("giocomai/latlon2map") # required dependency not on CRAN

remotes::install_github("EDJNet/streetnamer")

This package relies heavily on tidywikidatar.

Since all three packages (streetnamer, latlon2map, and tidywikidatar) are being developed concurrently, leaving to each a separate group of tasks, at this stage updates impacting the app may occur to any of them. Hence, if anything is not working as expected, you are invited to update those packages before reporting.

How does it work?

In order to get a preview of how the interface looks like, you can try running the following code chunks, and then run sn_run_app().

Keep in mind that OpenStreetMap data for the whole country are downloaded when you first select a city, so be prepared to wait many minutes. Municipality-level data are cached and retrieved efficiently afterwards.

library("streetnamer")
library("latlon2map")
library("tidywikidatar")
options(timeout = 60000) # big timeout, as big downloads needed 

ll_set_folder(path = fs::path(fs::path_home_r(),
                              "R",
                              "ll_data"))
#> /home/g/R/ll_data
sn_set_data_folder(fs::path(fs::path_home_r(),
                            "R",
                            "sn_data"))

# tidywikidatar cache
tw_set_cache_folder(path = fs::path(fs::path_home_r(),
                            "R",
                            "tw_data"))

## or in a temporary folder for testing

# tw_set_cache_folder(path = fs::path(tempdir(),
#                                     stringi::stri_rand_strings(n = 1, length = 24)))
#             

tw_create_cache_folder(ask = FALSE)

## if using rstudio, I'd suggest you set open this in your default browser
## rather than in rstudio's enabling the following option

# options(shiny.launch.browser = .rs.invokeShinyWindowExternal)

# sn_run_app()

Function naming conventions

streetnamer has two main types of functions:

  • a set of functions used to facilitate processing, that can conventionally be used from the command line, or internally by the Shiny app: they all start with sn_ followed by a verb, e.g. sn_get_lau_street_names()
  • a set of functions that are in effect Shiny modules (see below). They typically start with mod_sn_ and are currently not exported (as is customary for non-exported functions, they can be used with the triple :, e.g. streetnamer:::mod_sn_street_info_app) .

Shiny modules

In order to facilitate development, as well as to allow integration of component parts of this app in spin-off projects, key components of the Shiny app have been developed as modules and can be tested independently.

Module that shows info about Wikidata

streetnamer:::mod_sn_street_info_app(street_name = "Belvedere San Francesco",
                                     gisco_id = "IT_022205")

Module for showing

Module for exporting data

What happens in the background

The selectors on the top and the left allow to pick a municipality, and then a street.

When you click on a street name, a set of options to add data on a given street name appears. This is the choices that appear with this module:

streetnamer:::mod_sn_street_info_app(street_name = "Belvedere San Francesco",
                                     gisco_id = "IT_022205")

All the choices made in this interface are transformed into a data frame, that is written into a database:

sn_write_street_named_after_id(
  gisco_id = "IT_022205",
  country = "IT",
  street_name = "Belvedere San Francesco",
  person = TRUE,
  named_after_id = "Q676555",
  gender = "male",
  category = "religion",
  tag = "",
  checked = TRUE,
  session = "testing",
  append = TRUE,
  overwrite = FALSE,
  disconnect_db = TRUE
)
#> # A tibble: 1 × 14
#>   gisco_id  street_name      country named_after_id person gender category tag  
#>   <chr>     <chr>            <chr>   <chr>           <int> <chr>  <chr>    <chr>
#> 1 IT_022205 Belvedere San F… IT      Q676555             1 male   religion ""   
#> # ℹ 6 more variables: checked <int>, ignore <int>, named_after_n <int>,
#> #   named_after_custom_label <chr>, session <chr>, time <dttm>


street_info_df <- sn_get_street_named_after_id(
  gisco_id = "IT_022205",
  street_name = "Belvedere San Francesco",
  country = "IT"
)

street_info_df %>% 
  dplyr::distinct(gisco_id, .keep_all = TRUE) %>% 
  tidyr::pivot_longer(cols = dplyr::everything(),
                      names_to = "type",
                      values_to = "value",
                      values_drop_na = FALSE,
                      values_transform = as.character) %>% 
  print(n = 100)
#> # A tibble: 14 × 2
#>    type                     value                    
#>    <chr>                    <chr>                    
#>  1 gisco_id                 "IT_022205"              
#>  2 street_name              "Belvedere San Francesco"
#>  3 country                  "IT"                     
#>  4 named_after_id           "Q676555"                
#>  5 person                   "1"                      
#>  6 gender                   "male"                   
#>  7 category                 "religion"               
#>  8 tag                      ""                       
#>  9 checked                  "1"                      
#> 10 ignore                    <NA>                    
#> 11 named_after_n            "1"                      
#> 12 named_after_custom_label  <NA>                    
#> 13 session                  "testing"                
#> 14 time                     "1683793655.75202"

Each time the “confirm” button is clicked, a new row is added to the database. Hence, when you process the data you need to decided which criteria to use for keeping data, e.g. the most recent row, or the most confirmed.

This set of data support a number of special cases, and different degrees of information that can be shared:

Done: - data is confirmed at the country or city level - we expect data to be valid if confirmed at country levels, but especially with common surnames (or e.g. common names of saints, where one city has places dedicated to a locally born but globally less famous saint) it may be useful to check data at the city level - when checking if a street is tagged, this can be effectively done by filter for either the gisco_id column or the country column - it is possible to ignore a given street name - in OpenStreetMap is relatively common to have some streets that do not have a proper street name, mostly because they are improperly tagged (e.g. just a number, or a hyphen), or because they have descriptive names that are not actually street names (e.g. “access ramp to hospital”). These should simply be ignored and not added in the count of total streets. - this is expressed via the ignore column, with expected values either 1 (TRUE) or 0 (FALSE) - make it possible to confirm that a street name is not named after a human, without adding anything else - this is useful because in some use cases the main point of interest is humans, and requiring to add a Wikidata identifier would needlessly prolong the checking times - this is expressed via the person column, with expected values either 1 (TRUE) or 0 (FALSE) - make it possible to claim that a street is named after more than one person/individual - this is achieved by having a column with how many entities the street is dedicated to, named_after_n. When reading the data, if named_after_n is more than 1, then more than one row with data is expected to be found. Is is the responsibility of those who read the data do deal with potential inconsistencies

To do:

Deduplication

  • add Wikidata identifier of the actual street - this can be useful, as a number of properties are associated to it, possibly including different values for “named after” with qualifiers when street names changed
    • this is achieved with a separate column, wikidata_street_id. This should always be considered in combination with a given municipality.

Caching

Rather than adopting a separate caching infrastrucutre, streetnamer relies on the caching infrastructure of tidywikidatar. In brief, it generates separate tables with non-conflicting names in the same database used by tidywikidatar (be it a local SQLite or another odbc-compliant servers such as SQL)

Deployed shiny app

Given that Shiny Server limits access to environment variables, for the deployed app a connection must be directly passed to sn_run_app(), and cannot simply be set before startup (which works fine when running the app locally).

A workflow for mass processing outside of the shiny interface

N.B.: we used this approach, and it eventually worked, but lots of attention needs to be paid to the way files are shared.

First, as usual, you need to set up the folders where data will be stored

library("streetnamer")
library("latlon2map")
library("tidywikidatar")
options(timeout = 60000) # big timeout, as big downloads needed 

ll_set_folder(path = fs::path(fs::path_home_r(),
                              "R",
                              "ll_data"))
#> /home/g/R/ll_data

sn_set_data_folder(fs::path(fs::path_home_r(),
                            "R",
                            "sn_streetnamer_data"))

sn_create_data_folder(ask = FALSE)

# tidywikidatar cache
tw_set_cache_folder(path = fs::path(fs::path_home_r(),
                            "R",
                            "tw_streetnamer_data"))

tw_create_cache_folder(ask = FALSE)

tw_enable_cache(SQLite = TRUE)

Then, let’s say we want to find who streets are dedicated to in Berlin. We can find a full list with ll_get_lau_eu()

ll_get_lau_eu() %>% 
  sf::st_drop_geometry() %>% 
  dplyr::filter(stringr::str_detect(string = LAU_NAME, pattern = "Berlin"))
#> ℹ © EuroGeographics for the administrative boundaries
#> # A tibble: 10 × 9
#>    GISCO_ID   CNTR_CODE LAU_ID LAU_NAME POP_2021 POP_DENS_2 AREA_KM2  YEAR FID  
#>    <chr>      <chr>     <chr>  <chr>       <int>      <dbl>    <dbl> <int> <chr>
#>  1 DE_072330… DE        07233… Berling…      222       61.7     3.60  2021 DE_0…
#>  2 DE_110000… DE        11000… Berlin,…  3664088     4109.    892.    2021 DE_1…
#>  3 DE_120600… DE        12060… Bernau …    40908      391.    105.    2021 DE_1…
#>  4 DE_160610… DE        16061… Berling…     1225      105.     11.7   2021 DE_1…
#>  5 DE_120643… DE        12064… Neuenha…    18832      968.     19.5   2021 DE_1…
#>  6 DE_120644… DE        12064… Rüdersd…    16025      229.     70.1   2021 DE_1…
#>  7 DE_120674… DE        12067… Schönei…    12899      755.     17.1   2021 DE_1…
#>  8 CH_CH4801  CH        CH4801 Berling…      890      228.      3.90  2021 CH_C…
#>  9 FR_57064   FR        57064  Berling       270       85.5     3.16  2021 FR_5…
#> 10 IT_017015  IT        017015 Berlingo     2731      595.      4.59  2021 IT_0…

Berlin’s gisco_id is: DE_11000000

The first step is to get the streets. The first time you run this, this will likely take a long time due to download and filtering, but it will be cached automatically.

current_city <- "DE_11000000"
current_city_streets_sf <- ll_osm_get_lau_streets(gisco_id = current_city,
                                                  unnamed_streets = FALSE)
ggplot2::ggplot() +
  ggplot2::geom_sf(data = ll_get_lau_eu(gisco_id = current_city)) +
  ggplot2::geom_sf(data = current_city_streets_sf ) 
#> ℹ © EuroGeographics for the administrative boundaries

Now we’ll want to find to whom each street is dedicated to.

If you have no other source of information, a good starting point is the following. Notice that this will take a long time the first time you run it (possibly, a few hours with very big cities), but work almost instantly afterwards thanks to local caching.

sn_search_named_after(gisco_id = current_city)

However, here are some common use patterns. For example, rather than relying on the web interface, it may be quicker to check data in a spreadsheet. The following function exports data in a local subfolder (by defauly, sn_data), and stores csv files with all names of streets, with automatic guesses of who the street is dedicated to (the same can also be exported to geojson by setting the export_format parameter).

For ease of processing, files with humans and non-humans will be stored separately.

sn_get_details_by_lau(gisco_id = current_city,
                      export_format = "csv",
                      additional_properties = NULL, # you probably don't need so much details at this stage
                      manual_check_columns = TRUE)

For convenience, if you want to have all municipalities of a country processed in order of population size, you can use sn_get_details_by_lau().

You can then fix data in the spreadsheet by ticking with an x the tick_if_wrong column, and the fill in the columns whose name starts with fixed_ (all others will be ignored).

More specifically:

  • tick_if_wrong: expected either x, or empty. Since this package is mostly focused on humans, it expects that the humans files will be checked most thoroughly: if the tick_if_wrong column is left empty for a given row, then it will be assumed that the automatic matching is right. On the contrary, in the non_humans files, rows without the tick_if_wrong box will simply be ignored.
  • fixed_human: if a given row has a tick (typically, x), then it means that the row refers to a human. If left empty, that it does not refer to a human
  • fixed_named_after_id: if left empty, it is assumed that the Wikidata identifier is not known. If given, it must correspond to a Wikidata Q identifier, such as Q539
  • fixed_sex_or_gender: if left empty, no particular assumption will be made. If the Wikidata identifier is given, this can mostly be left empty, as the information will be derived from there. If given, it should be one of the options available in the online interface, or a their shortened form: female (f), male (m), other (o), uncertain, (u).
  • fixed_category: can typically be left empty
  • fixed_n_dedicated_to: if left empty, assumed to be one. This can be used to express when a street is dedicated to more than one person: in that case, the row should be duplicated as many times as the needed, and the same number be included in each row of fixed_n_dedicated_to.

Recently produced files may also include the following columns: - named_after_custom_label: this can be used when a full, clean name of the person a street is dedicated to can be desumed, or is otherwise known, but no Wikidata identifiers is available. Additional useful details can be added within brackets after the name. - fixed_ignore: if left empty, no assumption will be made. If ticked, it will be assumed that the row does not refer to a proper street.

After a file is processed, then it can be re-read and stored in the local database or re-uploaded to the web interface.

Let us assume that we have stored the fixed files for Berlin in sn_data_fixed/Germany:

current_fixed_files_v <- fs::dir_ls(path = fs::path("sn_data_fixed", "Germany"), recurse = TRUE, type = "file", glob = "*.csv")

Here is the data frame summarising all confirmed information we have in those previously exported tables:

current_city_confirmed_df <- purrr::map_dfr(.x = current_fixed_files_v,
                                            .f = function(x) {
  sn_import_from_manually_fixed(input_df = x,
                                return_df_only = TRUE)
})
current_city_confirmed_df

For context: setting the parameter return_df_only returns the data, setting it to TRUE stores it in the local database, from where it can be read with the following command.

sn_get_street_named_after_id(gisco_id = current_city)

Either way, current_city_confirmed_df should now include all confirmed humans as well as the custom fixed non-humans.

current_city_confirmed_df

The easiest way to get this data in a format that can easily be shared, is to use sn_export_checked().

I will spell out parameters for clarity, but you may well be happy with the defaults.

output_df <- sn_export_checked(
  gisco_id = current_city,
  source = "fixed_csv",  # this could be set to database 
  include_image_credits = TRUE, # useful if you plan to use images, but time consuming, as this implies a separate API call
  unlist = TRUE,  # needs to be set to TRUE for CSV, but better set to FALSE if doing further processing in R
  # additional_properties = c("P39", "P509", "P140", "P611", "P411", "P241", "P410", "P97", "P607", "P27", "P172") # this is if you want more properties
  export_folder = "sn_data_export", # here is where you'll find your files if you export them
  export_format = "csv" # can also be "geojson". Leave it to NULL if you do not intend to export
)
output_df

Some summary stats:

NB: consider that a single street can be dedicated to more than a human, and that some entities (fictional characters, deities, etc.) are not humans, but may have a defined gender.

summary_df <- tibble::tribble(~name, ~value,
                "gisco_id", unique(output_df$gisco_id), 
                "municipality_name", ll_get_lau_eu(gisco_id = unique(output_df$gisco_id), silent = TRUE) %>% dplyr::pull(LAU_NAME), 
                "total_streets", scales::number(nrow(current_city_streets_sf %>% sf::st_drop_geometry() %>% dplyr::distinct(name))), 
                "total_streets_named_after_humans", output_df %>%
  dplyr::filter(as.logical(person), as.logical(checked)) %>% 
    dplyr::distinct(street_name) %>% 
    nrow() %>% 
    scales::number(), 
  "total_streets_named_after_male", output_df %>%
  dplyr::filter(gender_label_combo == "male") %>% 
    dplyr::distinct(street_name) %>% 
    nrow() %>% 
    scales::number(),
    "total_streets_named_after_female", output_df %>%
  dplyr::filter(gender_label_combo == "female") %>% 
    dplyr::distinct(street_name) %>% 
    nrow() %>% 
    scales::number(),
  "total_streets_named_after_other_gender", output_df %>%
  dplyr::filter(gender_label_combo == "other") %>% 
    dplyr::distinct(street_name) %>% 
    nrow() %>% 
    scales::number(),
  "total_streets_named_after_more_than_1_n",output_df %>% dplyr::filter(is.na(named_after_n)==FALSE, named_after_n>1) %>% dplyr::distinct(street_name) %>% nrow() %>% scales::number(),
  "total_streets_named_after_human_with_qid", output_df %>%
  dplyr::filter(as.logical(person), as.logical(checked), is.na(named_after_id)==FALSE) %>% nrow() %>% scales::number(),
   "total_streets_named_after_human_without_qid", output_df %>%
  dplyr::filter(as.logical(person), as.logical(checked), is.na(named_after_id)==TRUE) %>% nrow() %>% scales::number(),
  "total_streets_named_after_human_with_unknown_gender", output_df %>%
  dplyr::filter(as.logical(person), as.logical(checked), is.na(gender_label_combo)==TRUE) %>% nrow() %>% scales::number())


print(summary_df, n = 100)

And a quick summary map:

streets_combo_sf <- 
  current_city_streets_sf %>% 
  dplyr::rename(street_name = name) %>% 
  dplyr::left_join(output_df, by = "street_name")

ggplot2::ggplot() +
  ggplot2::geom_sf(data = ll_get_lau_eu(gisco_id = current_city, silent = TRUE)) +
  ggplot2::geom_sf(data = streets_combo_sf %>% 
  dplyr::filter(is.na(gender_label_combo)), color = "lightgray" ) +
    ggplot2::geom_sf(data = streets_combo_sf %>% 
  dplyr::filter(is.na(gender_label_combo)==FALSE), mapping = ggplot2::aes(color = gender_label_combo )) +
  ggplot2::scale_color_viridis_d() +
  ggplot2::theme_minimal()

Data sources

  • OpenStreetMap data (© OpenStreetMap contributors) as kindly made available by Geofabrik

Desired features

It should be possible deal with the following circumstances:

  • streets that are on OSM
  • streets that are available on other lists, but not on OSM
  • streets with wikidata id or without
  • streets that are a person or not a person
  • different streets that are the same street (deduplication)
  • not a street / irrelevant
  • single street has more wikidata id (e.g. dedicated to two individuals)
  • add a tag for each street (maybe, free tag from a controlled vocabulary, e.g. to mark streets related to some issue that would not appear from relevant Wikidata identifier)

On naming things

OpenStreetMap groups all sorts of roads, streets, squares, and paths under the confusing label of “highway”. Within this package, the generic word used in function and documentation will be “streets”, as the package is expected to be used chiefly in reference to urban centres.

This package relies on different packages and data sources, hence mantaining full consistency in naming of data columns is not always straightforward.

As a rule, the following column naming conventions should be found across outputs from this package:

  • street_name: full street name, as it appears on OpenStreetMap (legacy, possibly still found, was name)
  • named_after_id: Wikidata identifier of the person or entity to which a street has been named after (legacy, inconsistently, id or wikidata_id)

Contributing

Suggestions and contributions are welcome; they can be discussed via GitHub issues.

Copyright and credits

This package has been created by Giorgio Comai, data analyst and researcher at OBCT/CCI, within the scope of EDJNet, the European Data Journalism Network.

It is distributed under the MIT license.