Recovering tweet contents from temp files after using `search_30day()` #388

AltfunsMA · 2020-01-26T23:17:30Z

First of all, thank you for writing this amazing package. I recently "lost" quite a few tweets that were nonetheless stored in temp queries. I'm trying to recover the contents without success.

thank you for any help!

Problem

Trying to recover tweet content from temp files stored by search_30day as .rds is difficult from raw data. It should be JSON, but it appears as raw, and it's not easily extracted.

Expected behavior

The content in the temp objects should be parsed.

Reproduce the problem

library(jsonlite)

response_object <- readRDS("~/tw/search30_tmp/20200123181732-1.rds")

str(response_object)
#> List of 10
#>  $ url        : chr "https://api.twitter.com/1.1/tweets/search/30day/Testing.json?query=%28climate%20OR%20%22global%20warming%22%20O"| __truncated__
#>  $ status_code: int 200
#>  $ headers    :List of 11
#>   ..$ content-encoding         : chr "gzip"
#>   ..$ content-length           : chr "74647"
#>   ..$ content-type             : chr "application/json; charset=utf-8"
#>   ..$ date                     : chr "Thu, 23 Jan 2020 07:17:31 GMT"
(snip)
#>  $ content    : raw [1:673974] 7b 22 72 65 ...
(snip)
#>  $ handle     :Class 'curl_handle' <externalptr> 
#>  - attr(*, "class")= chr "response"

raw_content <- response_object$content

tweets <- toJSON(raw_content)

str(tweets)
#>  'json' chr "[\"eyJyZXN1bHRzIjpbeyJjcmVhdGVkX2F0IjoiV2VkIEphbiAyMiAwNzoxMzozOSArMDAwMCAy\\nMDIwIiwiaWQiOjEyMTk4ODA3OTEwNjIzM"| __truncated__

text <- rawToChar(raw_content, multiple = TRUE)

paste(text[1:50], collapse = "")
#> [1] "{\"results\":[{\"created_at\":\"Wed Jan 22 07:13:39 +00"

rtweet version

0.6.9

The text was updated successfully, but these errors were encountered:

igorbrigadir · 2020-02-24T14:47:20Z

That looks like the entire response object, in which case you should be able to read in the same way as an httr response for example: https://datascienceplus.com/accessing-web-data-json-in-r-using-httr/

jsonResponseText <- content(response_object, as = "text")
jsonResponseText

hadley · 2021-03-04T23:07:58Z

Not sure how you got these files in the first place, but I'm pretty sure recovering data from is out of scope of rtweet.

AltfunsMA · 2021-03-05T01:31:34Z

@hadley , thanks for revisiting all these issues. The files are generated by default by rtweet functions that are used with the Premium tier Twitter API: search_30day and search_fullarchive. From the explanation of the safedir argument in those functions:

Name of directory to which each response object should be saved. If the directory doesn't exist, it will be created. If NULL (the default) then a dir will be created in the current working directory. To override/deactivate safedir set this to FALSE.

The intended purpose seems to be to help recover data that has been purchased rather than downloaded for free.

But the user is left on their own with a Twitter response object that they don't have to tackle anywhere else on rtweet, where the default is to get a nice tibble with everything parsed.

I think this could be satisfactorily addressed by adding some pointers in the documentation of those functions starting with @igorbrigadir's point? I have since become a bit more proficient at handling things, so I had forgotten this; happy to investigate a pithy way of pointing people in the right direction if you think that's useful.

hadley · 2021-03-05T12:53:13Z

I think I'd hang off on this because this code is likely to be refactored in the future, and safedir has been temporarily removed because I couldn't figure out what the purpose was. Could you explain what you want to get out of this feature?

AltfunsMA · 2021-03-07T23:04:30Z

I understand it as an easy fail-safe to prevent data (and funds) loss.

rtweet::search_30day("#rstats", n = 100000, env_name = "my_project", token = tk) uses up 200 queries at the paid tier level and if, all goes well, I get a nicely formatted 100K x 90 tibble. Perfect! I can pretty much start the analysis!

However, if for some reason the function does not return, I've wasted as many queries as the function has managed to carry out, (maybe all!), which can be big chunk of my allowance.

The safedir ensures that every time I get a response, it is stored somewhere so it is not wasted.

One may consider that the risk of search_30day failing is very low but testing that is risky in itself. Again, one could advise users to keep n in each call to search_30day at a lower level and take other precautions against wasting of queries; but those are not necessarily going to be any better than what's implemented already with safedir.

Having safedir and a suitable format_tweets() function that takes the folder with Twitter objects and returns the aforementioned tibble would give someone who just needs a bit of Twitter data a lot of peace of mind. Even a couple of instructions on how to do the work of format_tweets() would be great, I reckon.

Those are my two cents. I hope they make sense! I'm far from a Twitter API guru myself, but happy to help out if I can.

hadley · 2021-03-08T12:56:19Z

@AltfunsMA due to other changes, rate-limited pagination will always return early — you'll get a warning but no error.

Otherwise, I'm not sure whether it's the job of search_tweets() to do this. The problem is that there are many types of error, and it's really up to you as the user of rtweet to decide what to do with them (some more discussion along those lines at #339 (comment)). That said, dealing with HTTP errors is tricky, so maybe there should be some standard way to save progress for long running paginated queries. I'll make a note to consider this further.

llrs mentioned this issue Feb 15, 2021

Update roadmap #471

Closed

hadley closed this as completed Mar 4, 2021

hadley mentioned this issue Mar 8, 2021

Long running paginated queries should offer some way to save progress #531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering tweet contents from temp files after using `search_30day()` #388

Recovering tweet contents from temp files after using `search_30day()` #388

AltfunsMA commented Jan 26, 2020 •

edited

Loading

igorbrigadir commented Feb 24, 2020

hadley commented Mar 4, 2021

AltfunsMA commented Mar 5, 2021 •

edited

Loading

hadley commented Mar 5, 2021

AltfunsMA commented Mar 7, 2021 •

edited

Loading

hadley commented Mar 8, 2021

Recovering tweet contents from temp files after using search_30day() #388

Recovering tweet contents from temp files after using search_30day() #388

Comments

AltfunsMA commented Jan 26, 2020 • edited Loading

Problem

Expected behavior

Reproduce the problem

rtweet version

igorbrigadir commented Feb 24, 2020

hadley commented Mar 4, 2021

AltfunsMA commented Mar 5, 2021 • edited Loading

hadley commented Mar 5, 2021

AltfunsMA commented Mar 7, 2021 • edited Loading

hadley commented Mar 8, 2021

Recovering tweet contents from temp files after using `search_30day()` #388

Recovering tweet contents from temp files after using `search_30day()` #388

AltfunsMA commented Jan 26, 2020 •

edited

Loading

AltfunsMA commented Mar 5, 2021 •

edited

Loading

AltfunsMA commented Mar 7, 2021 •

edited

Loading