Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve ollama workflow from CI #53

Merged
merged 3 commits into from
Sep 25, 2024
Merged

Conversation

baxen
Copy link
Collaborator

@baxen baxen commented Sep 24, 2024

It'd be great to run this in CI, however the smaller model sizes end up very flakey on consistently using tools. But using larger model sizes makes this take a long time. I think we can revisit this when we get a more consistent small tool use model

It'd be great to run this in CI, however the smaller model sizes
end up very flakey on consistently using tools. But using larger model
sizes makes this take a long time. I think we can revisit this
when we get a more consistent small tool use model
@codefromthecrypt
Copy link
Contributor

I'm not sure the flake is something we want to skip, though. Both failures weren't due to the LLM, but something causing it to look for the openai model, not the model that's passed. This seems like something to investigate right?

{"error":{"message":"model \"gpt-4o-mini\" not found, try pulling it first","type":"api_error","param":null,"code":null}}

@baxen
Copy link
Collaborator Author

baxen commented Sep 24, 2024

I'm not sure the flake is something we want to skip, though. Both failures weren't due to the LLM, but something causing it to look for the openai model, not the model that's passed. This seems like something to investigate right?

{"error":{"message":"model \"gpt-4o-mini\" not found, try pulling it first","type":"api_error","param":null,"code":null}}

Yup you're totally right - that's where i went first too. The changes to test_integration.py here fix this problem, which was showing up in some PRs. Now that it's fixed though, i'm finding locally when running small models that the test_tools function fails pretty often:

AssertionError: assert 'gandalf' in 'the most famous wizard in the lord of the rings canon is aragorn breezy, the son of king elrond and grandson of legol...n he assists merry and pippin in obtaining elwater, the treasured horse that saved aragorn from the hands of the orcs.'
AssertionError: assert 'hello exchange' in 'sure! please provide me with the filename.'

We could consider trying to pass through a seed and temperature to get more consistency here though!

@codefromthecrypt
Copy link
Contributor

so do we feel deleting the tests is better than using a larger model? like qwen2.5 or just not override it and use mistral-nemo? larger ones take 4-5m total, but otoh it is more telling about the integration story.

Reason I ask is that all models can hallucinate or not give a precise response, just the likelihood is less. I had chatted with @michaelneale about this thing and if it isn't better to use a higher level retry concept than http 5xx, etc. and then handle the unexpected results. Deleting the test doesn't change this is what I mean, and we could at least learn if the default model is problematic routinely

@michaelneale
Copy link
Collaborator

I am ok with either temporarily disabling, or making it the larger one (if it is worth it) - I guess @baxen may be interested, is it worth 4m concurrent test run for larger LLM to test things a bit more end to end?

@codefromthecrypt
Copy link
Contributor

the other food for thought is that if/when we have an integration test in goose, possibly there's less pressure here. Just it will still be that 5m overhead to run that. Personally, I have projects whose integration tests take a lot longer than 5m, but otoh that doesn't make this part acceptable.

@michaelneale
Copy link
Collaborator

can also do integration tests with a GH secret and api call, but if it can work in similar time with local is "nice"?

@codefromthecrypt
Copy link
Contributor

can also do integration tests with a GH secret and api call, but if it can work in similar time with local is "nice"?

only from protected branches though right (not forks as that would let folks steal the secrets)? I guess since most changes are from the same team, testing public providers on these type of branches is better than not testing them.

That said, I think we can explore that independent of local tests which eventually need to be solved even if it isn't prioritized vs I guess openai.

@michaelneale
Copy link
Collaborator

yeah ideally wouldn't - GH redacts secrets etc but no way to be 100% sure with forks (and have to bless runs anyway)

@michaelneale
Copy link
Collaborator

is there a way to run this without it holding back the checks?
The larger model does make the check slower, but exchange doesn't change that much does it? I think may still be worthwhile.

Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to removing this as seems more are interested in this, and it conserves time for other things that we need to look at. Plus it prevents some changes I have from being blocked on a decision here. We can revisit this later and meanwhile ad-hoc test.

@codefromthecrypt
Copy link
Contributor

#54 swaps back to the default model which we can use until/if we merge deleting the workflow

@baxen baxen force-pushed the baxen/remove-ollama-workflow branch from 593b9cf to 9ee368f Compare September 25, 2024 00:18
@baxen
Copy link
Collaborator Author

baxen commented Sep 25, 2024

@codefromthecrypt @michaelneale take a look at the updates? I think - let's confirm - that setting the seed like this might fix the issue while still letting us use the small and fast model

@michaelneale
Copy link
Collaborator

I think problem was more that summarizer would default to gpt4o-mini - was there also some additional flake? if so, yeah this looks good as well

@baxen baxen force-pushed the baxen/remove-ollama-workflow branch from 9ee368f to f5a693b Compare September 25, 2024 00:22
Copy link
Collaborator

@michaelneale michaelneale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I think good

@michaelneale michaelneale changed the title fix: Remove ollama workflow from CI fix: improve ollama workflow from CI Sep 25, 2024
Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coolio!

tests/test_integration.py Show resolved Hide resolved
Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com>
@michaelneale
Copy link
Collaborator

yes, this works well: #59 I tested it on top of @lukealvoeiro's changes and seems good - I think we can move ahead with this.

@baxen baxen merged commit 9feb7ec into main Sep 25, 2024
5 checks passed
@michaelneale michaelneale deleted the baxen/remove-ollama-workflow branch September 25, 2024 02:52
@codefromthecrypt
Copy link
Contributor

thanks @baxen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants