Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering tweet contents from temp files after using search_30day() #388

Closed
AltfunsMA opened this issue Jan 26, 2020 · 6 comments
Closed

Comments

@AltfunsMA
Copy link

AltfunsMA commented Jan 26, 2020

First of all, thank you for writing this amazing package. I recently "lost" quite a few tweets that were nonetheless stored in temp queries. I'm trying to recover the contents without success.

thank you for any help!

Problem

Trying to recover tweet content from temp files stored by search_30day as .rds is difficult from raw data. It should be JSON, but it appears as raw, and it's not easily extracted.

Expected behavior

The content in the temp objects should be parsed.

Reproduce the problem

library(jsonlite)

response_object <- readRDS("~/tw/search30_tmp/20200123181732-1.rds")

str(response_object)
#> List of 10
#>  $ url        : chr "https://api.twitter.com/1.1/tweets/search/30day/Testing.json?query=%28climate%20OR%20%22global%20warming%22%20O"| __truncated__
#>  $ status_code: int 200
#>  $ headers    :List of 11
#>   ..$ content-encoding         : chr "gzip"
#>   ..$ content-length           : chr "74647"
#>   ..$ content-type             : chr "application/json; charset=utf-8"
#>   ..$ date                     : chr "Thu, 23 Jan 2020 07:17:31 GMT"
(snip)
#>  $ content    : raw [1:673974] 7b 22 72 65 ...
(snip)
#>  $ handle     :Class 'curl_handle' <externalptr> 
#>  - attr(*, "class")= chr "response"

raw_content <- response_object$content

tweets <- toJSON(raw_content)

str(tweets)
#>  'json' chr "[\"eyJyZXN1bHRzIjpbeyJjcmVhdGVkX2F0IjoiV2VkIEphbiAyMiAwNzoxMzozOSArMDAwMCAy\\nMDIwIiwiaWQiOjEyMTk4ODA3OTEwNjIzM"| __truncated__

text <- rawToChar(raw_content, multiple = TRUE)

paste(text[1:50], collapse = "")
#> [1] "{\"results\":[{\"created_at\":\"Wed Jan 22 07:13:39 +00"

rtweet version

0.6.9

@igorbrigadir
Copy link

That looks like the entire response object, in which case you should be able to read in the same way as an httr response for example: https://datascienceplus.com/accessing-web-data-json-in-r-using-httr/

jsonResponseText <- content(response_object, as = "text")
jsonResponseText

@llrs llrs mentioned this issue Feb 15, 2021
@hadley
Copy link
Member

hadley commented Mar 4, 2021

Not sure how you got these files in the first place, but I'm pretty sure recovering data from is out of scope of rtweet.

@hadley hadley closed this as completed Mar 4, 2021
@AltfunsMA
Copy link
Author

AltfunsMA commented Mar 5, 2021

@hadley , thanks for revisiting all these issues. The files are generated by default by rtweet functions that are used with the Premium tier Twitter API: search_30day and search_fullarchive. From the explanation of the safedir argument in those functions:

Name of directory to which each response object should be saved. If the directory doesn't exist, it will be created. If NULL (the default) then a dir will be created in the current working directory. To override/deactivate safedir set this to FALSE.

The intended purpose seems to be to help recover data that has been purchased rather than downloaded for free.

But the user is left on their own with a Twitter response object that they don't have to tackle anywhere else on rtweet, where the default is to get a nice tibble with everything parsed.

I think this could be satisfactorily addressed by adding some pointers in the documentation of those functions starting with @igorbrigadir's point? I have since become a bit more proficient at handling things, so I had forgotten this; happy to investigate a pithy way of pointing people in the right direction if you think that's useful.

@hadley
Copy link
Member

hadley commented Mar 5, 2021

I think I'd hang off on this because this code is likely to be refactored in the future, and safedir has been temporarily removed because I couldn't figure out what the purpose was. Could you explain what you want to get out of this feature?

@AltfunsMA
Copy link
Author

AltfunsMA commented Mar 7, 2021

I understand it as an easy fail-safe to prevent data (and funds) loss.

rtweet::search_30day("#rstats", n = 100000, env_name = "my_project", token = tk) uses up 200 queries at the paid tier level and if, all goes well, I get a nicely formatted 100K x 90 tibble. Perfect! I can pretty much start the analysis!

However, if for some reason the function does not return, I've wasted as many queries as the function has managed to carry out, (maybe all!), which can be big chunk of my allowance.

The safedir ensures that every time I get a response, it is stored somewhere so it is not wasted.

One may consider that the risk of search_30day failing is very low but testing that is risky in itself. Again, one could advise users to keep n in each call to search_30day at a lower level and take other precautions against wasting of queries; but those are not necessarily going to be any better than what's implemented already with safedir.

Having safedir and a suitable format_tweets() function that takes the folder with Twitter objects and returns the aforementioned tibble would give someone who just needs a bit of Twitter data a lot of peace of mind. Even a couple of instructions on how to do the work of format_tweets() would be great, I reckon.

Those are my two cents. I hope they make sense! I'm far from a Twitter API guru myself, but happy to help out if I can.

@hadley
Copy link
Member

hadley commented Mar 8, 2021

@AltfunsMA due to other changes, rate-limited pagination will always return early — you'll get a warning but no error.

Otherwise, I'm not sure whether it's the job of search_tweets() to do this. The problem is that there are many types of error, and it's really up to you as the user of rtweet to decide what to do with them (some more discussion along those lines at #339 (comment)). That said, dealing with HTTP errors is tricky, so maybe there should be some standard way to save progress for long running paginated queries. I'll make a note to consider this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants