Was excited to see something about reinforcement learning as I'm working on training an agent to play a game, but apparently all reinforcement learning nowadays is for LLMs.
ag8 12 hours ago [-]
Yeah, for better or worse, the way the median startup interfaces with AI these days is through an LLM API, and that's what all the workflows are built around, so that's what we're targeting. Though, depending on what you're trying to do, I wouldn't discount the use of starting with a pretrained model—there was that famous result from 2022 that showed that pretraining a model on _Wikipedia_ made training on Atari games more than twice as efficient [0]; these days, LLMs have huge amounts of priors about the real world that make them great starting points for a surprisingly diverse set of tasks (e.g. see the chemistry example in our video!)
This is really neat! Didn’t realize it could be this simple to run RL on models. Quick question: How would I specify the reward function for tool use? or is this something you automatically do for me when I specify the available tools and their uses?
ag8 6 hours ago [-]
Thanks! Our goal is to make rl "just work" with completely automated GPU provisioning/algorithm selection/SFT-warm up, but giving people the ability to switch away from the defaults if they want to.
The way tools currently work in the beta is you add tools via MCP to the configuration, and they get passed in as additional context for the model; the model might then choose to use a tool during inference; the tool is then automatically called and the output is returned as a tool message. If you really want to you could parse the tool output as part of reward calculation, but I expect you'd usually base the reward just on the model's completion. I could give more details if there's a specific tool setup you're envisioning!
-_- 5 hours ago [-]
To add to this, you can currently manually parse tool calls in your environment's step function, but we'll be rolling out a UI that makes this easier soon.
nextworddev 13 hours ago [-]
Is there any credence to the view that these startups are basically dspy wrappers
-_- 13 hours ago [-]
DSPy is great for prompt optimization but not so much for RL fine-tuning (their support is "extremely EXPERIMENTAL"). The nice thing about RL is that the exact prompts don't matter so much. You don't need to spell out every edge case, since the model will get an intuition for how to do its job well via the training process.
nextworddev 13 hours ago [-]
Isn’t the latest trend in RL mostly about prompt optimization as opposed to full fine tuning
ag8 12 hours ago [-]
prompt optimization is very cool, and we use it for certain problems! The main goal with this launch is to democratize access to "the real thing"; in many cases, full RL allows you to get the last few percent in reliability for things like complex agentic workflows where prompt optimization doesn't quite get you far enough.
There's also lots of interesting possibilities such as RLing a model on a bunch of environments and then prompt optimizing it on each specific one, which seems way better than, like, training and hot-swapping many LoRAs. In any case, _someone's_ ought to provide a full RL api, and we're here to do that well!
nextworddev 12 hours ago [-]
Thanks. Is this mainly for verifiable tasks or any general task
ag8 12 hours ago [-]
It's for any task that has an "eval", which is often verifiable tasks or ones that can be judged by LLMs (e.g. see [0]). There's also been recent work such as BRPO [1] and similar approaches to make more and more "non-verifiable" tasks have verifiable rewards!
There needs to be some way of automatically assessing performance on the task, though this could be with a Python function or another LLM as a judge (or a combination!)
ART is also great, though since it's built on top of Unsloth it's geared towards single GPU QLoRA training. We use 8 H100s as a standard, so we can handle larger models and full-parameter fine-tunes.
omneity 6 hours ago [-]
Interesting, do you have benchmarks on FFT vs QLoRA for RL?
[0]: https://arxiv.org/abs/2201.12122
The way tools currently work in the beta is you add tools via MCP to the configuration, and they get passed in as additional context for the model; the model might then choose to use a tool during inference; the tool is then automatically called and the output is returned as a tool message. If you really want to you could parse the tool output as part of reward calculation, but I expect you'd usually base the reward just on the model's completion. I could give more details if there's a specific tool setup you're envisioning!
There's also lots of interesting possibilities such as RLing a model on a bunch of environments and then prompt optimizing it on each specific one, which seems way better than, like, training and hot-swapping many LoRAs. In any case, _someone's_ ought to provide a full RL api, and we're here to do that well!
[0]: https://runrl.com/blog/funniest-joke
[1]: https://arxiv.org/abs/2506.00103