Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify response parsing #572

Merged
merged 43 commits into from
Jul 19, 2021
Merged

Simplify response parsing #572

merged 43 commits into from
Jul 19, 2021

Conversation

llrs
Copy link
Member

@llrs llrs commented Apr 22, 2021

Simplify object parsing towards #558

Steps:

  • Internal test for objects
  • Simplify and understand code
  • Provide helpers

tweets_to_tbl_ is doing the internal tests and transforming the data.frame from X columns to Y columns for all these functions:

function/endpoint From To
get_retweets 24 73
get_timeline_user
lookup_tweets
lookup_users
get_favorites_user
get_timeline
lists_status
get_my_timeline
search_users
search_tweets
post_message
search_fullarchive
search_30day
tweet_shot

*Numbers from the test cases.

Probably a list of endpoints that supply each kind of object will be much useful.

@llrs llrs changed the title Use new helper function to simplify double calls to has_name_ Simplify response parsing Apr 22, 2021
@llrs llrs mentioned this pull request Apr 28, 2021
@llrs
Copy link
Member Author

llrs commented Apr 30, 2021

Updating the table:

function/endpoint From To
get_retweets 24 73
get_timeline_user 31 73
lookup_tweets 26 73
lookup_users 44 20
get_favorites_user 30 73 + 1
get_timeline 31 73
lists_status 31 73
get_my_timeline 32 73
search_users 44 20
search_tweets 31 73

On some cases there is less columns when the content is parsed than when not (lookup_users, search_users)! This is due to the users_with_tweets function, but apparently the API is returning lots of data from deprecated arguments but misses some data from supported values.
I'll start improving the user-object

The 73 + 1 is the user added when parsed to know which user has liked which tweet.

@llrs
Copy link
Member Author

llrs commented May 5, 2021

Documenting progress, about blocking problems at the moment:

  • Tweets, and all responses, are now parsed based on the objects they contain.
  • The latest changes trigger a warning: "Setting row names on a tibble is deprecated.", although row names are not used in any place to parse tweets. It only happens on do.call("rbind", ) from tweets_with_users and users_with_tweets.
  • For instance out <- get_favorites(users, n = 20) triggers an error z[seq_len(nro), ] <- y : replacement has length zero on xpdrows.data.frame. Which is surprising and couldn't understand why or how it is triggered. I think it is related to having data.frames inside data.frames inside data.frames.
  • The new data.frame has much less columns and different names to keep consistent with the API, so functions (and the corresponding test) that depend on certain columns fail.
  • Due to the above problems, it hasn't been tested if the tweets object has the same columns regardless of the endpoint called (Is this desired?) .

@llrs llrs marked this pull request as ready for review June 27, 2021 15:40
@llrs
Copy link
Member Author

llrs commented Jun 27, 2021

Many changes, but still need to add a test for #574 (and didn't check for #575 but should be fixed too)

  • Added some test for different type of tweets so that rtweet returns the same information
  • The columns are on the same order
  • Localization and other objects are now correctly identified/used
  • Some reorganizing on network_data

Maybe some columns can be further nested or hidden on an attribute.
There isn't any helper to identify replies, comments, quotes and simply status.
Moving away of tibbles as nested tibbles cannot be rbinded without a warning and would mean adding a new direct dependency tidyverse/tibble#898

@llrs
Copy link
Member Author

llrs commented Jun 29, 2021

@hadley Maybe you can give a look at this PR. Probably the best way to start is on the tweet function and the new functions to parse twitter objects.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some great improvements overall! Questions/comments below.

R/entities_objects.R Show resolved Hide resolved
R/entities_objects.R Show resolved Hide resolved
R/entities_objects.R Outdated Show resolved Hide resolved
R/entities_objects.R Outdated Show resolved Hide resolved
# <https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/entities#hashtags>
hashtags <- function(x) {
if (NROW(x) == 0) {
return(data.frame(text = NA, indices = I(list(NA)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that a zero row input yields a one row output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason is to not omit anything for later easier parsing and linking between hasthags and tweets. There is also the problem when a user hasn't tweeted if the information about the tweets of that users and other users are collected it would break the tweet_data extraction see #574 for such a report.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think you always want to return a data frame here, but are you sure it should have one row and not zero rows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm sure. What is your concern? Filling the object with mostly empty data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just generally reasoning about this function. I don't understand how all the pieces fit together but this feels like a code smell.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some tests and it seems that mixing NAs and data.frames would be correctly handled by rbind unless there are nested data.frames. I'll wait a bit to merge the PR but I can't think of any workaround that won't complicate more the workflow.

Tests
``` r
l <- list(a = NA, b = data.frame(a = "b", b = "a"), c = data.frame(a = c("C", "D"), b = c("B", "C")))
do.call(rbind, l)
#>        a    b
#> a   <NA> <NA>
#> b      b    a
#> c.1    C    B
#> c.2    D    C
l <- list(a = data.frame(a = "b", b = "a"), b = NA, c = data.frame(a = c("C", "D"), b = c("B", "C")))
do.call(rbind, l)
#>        a    b
#> a      b    a
#> b   <NA> <NA>
#> c.1    C    B
#> c.2    D    C
l <- list(a = data.frame(a = "b", b = "a"), b = NA, c = data.frame(a = "C", b = "B"))
do.call(rbind, l)
#>      a    b
#> a    b    a
#> b <NA> <NA>
#> c    C    B
l <- list(a = data.frame(a = "b", b = "a"), b = data.frame(a = c("C", "D"), b = c("B", "C")), c = NA)
do.call(rbind, l)
#>        a    b
#> a      b    a
#> b.1    C    B
#> b.2    D    C
#> c   <NA> <NA>
library("tibble")
l <- list(a = NA, b = tibble(a = "b", b = "a"), c = tibble(a = c("C", "D"), b = c("B", "C")))
do.call(rbind, l)
#> # A tibble: 4 x 2
#>   a     b    
#> * <chr> <chr>
#> 1 <NA>  <NA> 
#> 2 b     a    
#> 3 C     B    
#> 4 D     C
l <- list(a = tibble(a = "b", b = "a"), b = NA, c = tibble(a = c("C", "D"), b = c("B", "C")))
do.call(rbind, l)
#> # A tibble: 4 x 2
#>   a     b    
#> * <chr> <chr>
#> 1 b     a    
#> 2 <NA>  <NA> 
#> 3 C     B    
#> 4 D     C
l <- list(a = tibble(a = "b", b = "a"), b = NA, c = tibble(a = "C", b = "B"))
do.call(rbind, l)
#> # A tibble: 3 x 2
#>   a     b    
#> * <chr> <chr>
#> 1 b     a    
#> 2 <NA>  <NA> 
#> 3 C     B
l <- list(a = tibble(a = "b", b = "a"), b = tibble(a = c("C", "D"), b = c("B", "C")), c = NA)
do.call(rbind, l)
#> # A tibble: 4 x 2
#>   a     b    
#> * <chr> <chr>
#> 1 b     a    
#> 2 C     B    
#> 3 D     C    
#> 4 <NA>  <NA>


l <- list(a = NA, b = data.frame(a = I(data.frame(c = 1)), b = "a"), c = data.frame(a = I(data.frame(c = c(2, 3))), b = c("B", "C")))
do.call(rbind, l)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = data.frame(a = I(data.frame(c = 1)), b = "a"), b = NA, c = data.frame(a = I(data.frame(c = c(2, 3))), b = c("B", "C")))
do.call(rbind, l)
#> Warning: non-unique values when setting 'row.names': '1', '2'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = data.frame(a = I(data.frame(c = 1)), b = "a"), b = NA, c = data.frame(a = I(data.frame(c = c(2))), b = "B"))
do.call(rbind, l)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = data.frame(a = I(data.frame(c = 1)), b = "a"), b = data.frame(a = I(data.frame(c = c(2, 3))), b = c("B", "C")), c = NA)
do.call(rbind, l)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = NA, b = tibble(a = I(data.frame(c = 1)), b = "a"), c = tibble(a = I(data.frame(c = c(2, 3))), b = c("B", "C")))
do.call(rbind, l)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = tibble(a = I(data.frame(c = 1)), b = "a"), b = NA, c = tibble(a = I(data.frame(c = c(2, 3))), b = c("B", "C")))
do.call(rbind, l)
#> Warning: non-unique values when setting 'row.names': '1', '2'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = tibble(a = I(data.frame(c = 1)), b = "a"), b = NA, c = tibble(a = I(data.frame(c = c(2))), b = "B"))
do.call(rbind, l)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
l <- list(a = tibble(a = I(data.frame(c = 1)), b = "a"), b = tibble(a = I(data.frame(c = c(2, 3))), b = c("B", "C")), c = NA)
do.call(rbind, l)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed

Created on 2021-07-13 by the reprex package (v2.0.0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but I don't see you creating an zero row data frames in your tests? But I don't really understand how all of this data flows together, and there's likely to be something upstream that means my suggestion wouldn't work.

R/graph-network.R Outdated Show resolved Hide resolved
R/graph-network.R Outdated Show resolved Hide resolved
R/utils.R Outdated
@@ -100,7 +104,7 @@ maybe_n <- function(x) {
}

is_testing <- function() {
identical(Sys.getenv("TESTTHAT"), "true")
requireNamespace("testthat", quietly = TRUE) && identical(Sys.getenv("TESTTHAT"), "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check shouldn't be needed since only testthat should be setting the envvar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should because later on there are some function calling testthat function inside the package (skip_*). I swapped the order though so that if the variable is not present the package won't be loaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to change is_testing() because it's a standard helper you'd generally expect to be copied directly from testthat. Where is skip() being used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases where this happens; on http.R lines 309-312:

handle_rate_limit <- function(x, api, retryonratelimit = NULL, verbose = TRUE) {
  if (is_testing()) {
    testthat::skip("Rate limit exceeded")
  }
....

and auth.R lines 273-275

no_token <- function() {
  if (is_testing()) {
    testthat::skip("Auth not available")
...

I could move the check outside the function, but I thought it was more accurately to leave it inside it.

tests/testthat/test-entities_objects.R Outdated Show resolved Hide resolved
tests/testthat/test-extractors.R Show resolved Hide resolved
@llrs
Copy link
Member Author

llrs commented Jul 6, 2021

Many thanks for all the comments. Will try to reply and address them soon (1 or 2 weeks probably).

@mkearney
Copy link
Collaborator

mkearney commented Oct 20, 2021

I'm moving this comment to #558 because it's more generally about the changes and less about implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants