There has been a lot of talk around teach mode, and for many (most?) users this was the main reason to get an R1. Based on how the current “LAM” operations are implemented, I feel that expectations are perhaps too high which can only result in disappointment.
For example I have seen suggestions on this forum to teach the R1 to control external USB connected equipment and robots.
My expectations are a lot more modest (but could still be too high). I expect Teach Mode to have limitations when it is released, most notably:
Support is likely to be limited to applications on web sites
Output from websites is likely to be limited to text or just a confirmation of fail/success
Teach mode is probably available for selected apps/websites
When taught, the created integrations will need maintenance, they will fail at some point, some more frequently than others.
Ad 1: Android and iOS need specific controls implemented which are not there for the existing integrations so currently likely to be in an experimental stage at best
Ad 2: The current integrations have specific apps implemented in RabbitOS to present the output from the apps. This lets you not only control, eg, Spotify, but also listen to the songs and see album covers. These OS “apps” were built specifically for the existing integrations. Teach mode is unlikely to include a UI builder to run on the R1 to create such apps and handle returned data from the app/site. So if you teach Rabbit to use your favourite weather forecasting site, don’t expect to see the web site’s visualisation for the forecast on the R1.
Ad 3: The risk of malicious use of having AI control apps was recognized before by the rabbit team and a white list has been suggested. This also makes technical sense because the variation of user interfaces on the web is mindboggling and not everything will be controllable using (AI generated or static) Playwright scripts. There may also be regional limitations where sites throw additional warnings or even block when accessed from a US based IP address.
Ad 4: The AI is taught based on an existing UI. If the UI changes, the training can fail. This can happen at any time.
I like to have realistic expectations when it comes to new features or services, if you expect additional limitations that would be great to know and anticipate on. Also if you believe I am too conservative with my expectations, please challenge!
Is it really the case that when the UI changes, the LAM no longer knows where to click? That surprises me. I originally understood from the keynote that the model would indeed be able to recognize where the “ok” button is, no matter where it is located. Otherwise, it would just be like a kind of macro recorder. Is that what we’re facing, or are we going to get a truly autonomous, self-recognizing LAM?
Note that I’m just sharing my expectations here, based on the information I have seen so far and a mix of logic and gut feeling of what is realistic.
To elaborate: for web sites, “LAM” uses Playwright as a technology to control web pages (see explanation). From what I understand Playwright scripts can be setup to cope with some types of UI changes. Ergo, it would depend on the type of change to the page and the design of the (possibly AI-generated) scripts how robust the rabbit “trained” through teach mode is to UI changes. In other words: I expect to see changes that a trained rabbit will not be able to cope with.
To address the doubts about the teaching mode, the cooperation between the Rabbit team and application and website developers is crucial. Only through the joint cooperation of both parties can the teaching mode approach and complete the set goals to a greater extent.
But for a young company, the cost of these partnerships may be unaffordable.
So I’m very skeptical about how the teaching model will be implemented.
I think the official is to let the user show the AI how to operate the software in a teaching way. When they do, scripts can be generated simultaneously in the background. This minimizes the cost and solves the problem of UI not working after the update. If there is an error after a UI update one day, the user only needs to perform one action to display the AI, and then update the script in the background to fix the problem. However, this approach has basically completely deviated from the original intention of artificial intelligence.
This solution seems rather expedient. While it may provide a practical solution to the immediate problem of UI changes, it doesn’t really harness the power and potential of AI as much as we thought. Ideally, AI should be able to adapt and learn independently, without relying on the actions that users constantly demonstrate. But in the absence of a more complex solution, this may be a stopgap measure to maintain at least some level of functionality.
And I also wish that rabbits really had an autonomous mind and not a script.
【Although rabbit officials have said that before you question these questions, you have to prove that you have a degree or a job in AI or anything else is nonsense.】
I agree with this partly; if web site developers actively add layers designed to counter automated control this will obviously frustrate the “LAM” access. But part of the flexibility of using the UI (and not an API) is that it can be setup (taught) without the need for detailed collaboration with the website developers.
The expectation of the AI to autonomously interpret and adjust even after major UI changes is what I mean with high expectations. If that would be the in place, imo there would not be a need to do any training in the first place, you would just have to specify your intent and have the AI figure out how to do it.
I wonder if such cooperation could be achieved with Opensource applications like Gimp, Inkscape, Blender, or Open Office. And having done so that “teach” is available to everyone. I have made txt lists but have not found a way to make them into a database that r1 can read.
From the presentation Jesse and Simon gave a few months back on Discord, it appears that the models you train will become pros at accomplishing the idea of your task, not just using the webpage. For instance, Jesse trained a LAM to look up images of Elon Musk on Google. The model was then recalled, and asked to look up images of something else (I forgot what it was, but let’s just say it was the Eiffel Tower). Because of how the model was trained, Google was not required for the action, and it actually chose to use Bing since that’s a more popular search engine where Jesse was at the time of the meeting. The LAM was able to easily navigate Bing instead of Google, and bring up images of the requested item instead of images of Elon Musk. We will train the models how to accomplish the idea of the task, not the specific task itself, which makes these models FAR more useful and impressive!