Skip to content
This repository has been archived by the owner on Feb 7, 2024. It is now read-only.

What is size and limit? What decides how many results I get? #79

Closed
FerusAndBeyond opened this issue Aug 17, 2020 · 2 comments
Closed

Comments

@FerusAndBeyond
Copy link

If I do a query such as gen = api.search_submissions(score=">100", limit=1000) then I get 100 results. How do I get as many as I specify?

@reagle
Copy link

reagle commented Aug 27, 2020

Coincidence! I asked the same thing about Pushshift here: https://www.reddit.com/r/pushshift/comments/ih66b8/difference_between_size_and_limit_and_are_they/

@Jabb0
Copy link
Contributor

Jabb0 commented Nov 28, 2020

Hi,
I've tried to use a simple call to get 1000 entries (limit=1000) from a subreddit
list(api.search_submissions(subreddit="worldnews", limit=1000))
and as @FerusAndBeyond I only get 100 results and then it stops.

I have investigated the source code and found a possible issue.
This might be related to #63 and #47 as well.

PushiftAPI.py lines 197 to 218

def _handle_paging(self, url):
    limit = self.payload.get('limit', None)
    #n = 0
    while True:
        if limit is not None:
            if limit > self.max_results_per_request:
                self.payload['limit'] = self.max_results_per_request
                limit -= self.max_results_per_request
            else:
                self.payload['limit'] = limit
                limit = 0
        elif 'ids' in self.payload:
            limit = 0
            if len(self.payload['ids']) > self.max_results_per_request:
                err_msg = "When searching by ID, number of IDs must be fewer than the max number of objects in a single request ({})."
                raise NotImplementedError(err_msg.format(self.max_results_per_request))
        self._add_nec_args(self.payload)

        yield self._get(url, self.payload)

        if (limit is not None) & (limit == 0):
            return

This tries to perform as many request as needed for retrieving all of the desired data. The meaning of Limit for PSAW is different from the limits of the Pushshift API in the sense that PSAW tries multiple fetches to get close to the desired limit. The Pushshift API however will just take it as a suggestion for the current request. Therefore, it is calculated how many batches of "max_results_per_request" size are needed. This is then given to the pushshift API as "limit".

The issue is that it is not checked if "max_results_per_request" entries are actually returned by the API. The current default is 1000, which is an earlier max size the API will return. However, now it is 100. This means that the API will only 100 entries when PSAW thinks 1000 are returned.

My suggestion: Implement a check if the API returns the expected amount of entries and if not increase the "limit" variable by the missing amount. I have some code for that already, will post it tomorrow.
For now setting api = PushshiftAPI(max_results_per_request=100) will solve the issue.

Also: Why is there (limit is not None) & (limit == 0) and not (limit is not None) and (limit == 0)?

Without "limit" PSAW will ignore "max_results_per_request" and just return whatever the API defaults to.
EDIT: Not the case. This is handled already.

I hope my analysis helps :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants