What is size and limit? What decides how many results I get? #79

FerusAndBeyond · 2020-08-17T14:07:34Z

If I do a query such as gen = api.search_submissions(score=">100", limit=1000) then I get 100 results. How do I get as many as I specify?

The text was updated successfully, but these errors were encountered:

reagle · 2020-08-27T11:47:25Z

Coincidence! I asked the same thing about Pushshift here: https://www.reddit.com/r/pushshift/comments/ih66b8/difference_between_size_and_limit_and_are_they/

Jabb0 · 2020-11-28T00:38:05Z

Hi,
I've tried to use a simple call to get 1000 entries (limit=1000) from a subreddit
list(api.search_submissions(subreddit="worldnews", limit=1000))
and as @FerusAndBeyond I only get 100 results and then it stops.

I have investigated the source code and found a possible issue.
This might be related to #63 and #47 as well.

PushiftAPI.py lines 197 to 218

def _handle_paging(self, url):
    limit = self.payload.get('limit', None)
    #n = 0
    while True:
        if limit is not None:
            if limit > self.max_results_per_request:
                self.payload['limit'] = self.max_results_per_request
                limit -= self.max_results_per_request
            else:
                self.payload['limit'] = limit
                limit = 0
        elif 'ids' in self.payload:
            limit = 0
            if len(self.payload['ids']) > self.max_results_per_request:
                err_msg = "When searching by ID, number of IDs must be fewer than the max number of objects in a single request ({})."
                raise NotImplementedError(err_msg.format(self.max_results_per_request))
        self._add_nec_args(self.payload)

        yield self._get(url, self.payload)

        if (limit is not None) & (limit == 0):
            return

This tries to perform as many request as needed for retrieving all of the desired data. The meaning of Limit for PSAW is different from the limits of the Pushshift API in the sense that PSAW tries multiple fetches to get close to the desired limit. The Pushshift API however will just take it as a suggestion for the current request. Therefore, it is calculated how many batches of "max_results_per_request" size are needed. This is then given to the pushshift API as "limit".

The issue is that it is not checked if "max_results_per_request" entries are actually returned by the API. The current default is 1000, which is an earlier max size the API will return. However, now it is 100. This means that the API will only 100 entries when PSAW thinks 1000 are returned.

My suggestion: Implement a check if the API returns the expected amount of entries and if not increase the "limit" variable by the missing amount. I have some code for that already, will post it tomorrow.
For now setting api = PushshiftAPI(max_results_per_request=100) will solve the issue.

Also: Why is there (limit is not None) & (limit == 0) and not (limit is not None) and (limit == 0)?

Without "limit" PSAW will ignore "max_results_per_request" and just return whatever the API defaults to.
EDIT: Not the case. This is handled already.

I hope my analysis helps :)

Jabb0 mentioned this issue Nov 28, 2020

Made the download aware of the actual returned batch size when using limit #88

Merged

dmarx closed this as completed Jan 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is size and limit? What decides how many results I get? #79

What is size and limit? What decides how many results I get? #79

FerusAndBeyond commented Aug 17, 2020

reagle commented Aug 27, 2020

Jabb0 commented Nov 28, 2020 •

edited

Loading

What is size and limit? What decides how many results I get? #79

What is size and limit? What decides how many results I get? #79

Comments

FerusAndBeyond commented Aug 17, 2020

reagle commented Aug 27, 2020

Jabb0 commented Nov 28, 2020 • edited Loading

Jabb0 commented Nov 28, 2020 •

edited

Loading