Gemini analyzes the scene and replies with a function call such as “click,” “type,” or “scroll,” which the client executes. Then you send back a fresh screenshot and URL, and the cycle repeats until ...