Reputation
Badges 1
127 × Eureka!I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh
but instead it's -v /root.ssh:/.ssh
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
Actually that's wrong: really this is the current volume mount
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
Could changing these values to /root/.ssh
work? Do you know what use within the docker image ClearML is using?
Haha, that was a total gotcha for me. Yeah, a lot just wasn't even getting run due to the #!/bin/bash
part.
Anyway, wow! I finally got the precious console logs you thought to find, here they are:
2023-05-06 00:19:21
User aborted: stopping task (3)
2023-05-06 00:19:21
Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7....
Let's see. The task log? I think this is it.
And for the session
clearml-session --queue sessions --docker python:3.9
Disclaimer: I'm not familiar enouch with the ClearML codebase to vouch for the quality of this PR, although it is short which is typically good . The feature we're interested in is the ability to specify the subnet_id
.
Here's a screenshot if a session where I first try to clone as ssm-user
, but it fails, then I change to root
and it succeeds
Oh! System tags! That would definitely have been a better way to do it. We ended up querying for tasks in the "DevOps" project with the name "Interactive Session"
Hey @<1523701482157772800:profile|AnxiousSeal95> ! I think ClearML's orchestrator is a great fit for ad-hoc experimentation, but not for (event-triggered) batch inference jobs that need to be relied on in production.
I'd only feel comfortable supporting pipelines that serve end users on a tool that is known for that, e.g. Metaflow, Dagster, or Airflow--mainly because those tools emphasize good monitoring and integration with the wider data ecosystem.
I don't see it as an argument in Task.init
or Task.execute_remotely
Is there some way we could programmatically list all current ClearML sessions?
We need a way to do that, maybe with the clearml-session
CLI in order to populate the VS Code extension menu.
I've also tried running a clearml-agent daemon
directly on my mac (not in docker) serving the sessions
queue for the ClearML server that is running in docker. When I do that, it consistently fails with a different error. Something to do with mounting a volume.
But I actually wish the interface were more like the apiserver.conf
file--specifically, that you can define hard-coded credentials in this file in advance. Except, I wish that you could define API keys this way (or some other way)
auth {
# Fixed users login credentials
# No other user will be able to login
fixed_users {
enabled: true
pass_hashed: false
users: [
{
username: "test"
password: "test"
...
@<1523701070390366208:profile|CostlyOstrich36> Oh thatβs smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
^^^ For my own notes: this is the web request made by the frontend to create a set of credentials
Oh my goodness. Thank you! I'd seen that before, but for some reason it didn't register I could run that with VS Code...
But this config should almost never need to change!
Host clearml-session
HostName localhost
User root
Port 8022
I did a quick local experiment and observed that credentials created from the UI indeed become invalid if you delete the ClearML volumes.
- starting docker-compose locally
- creating a set of credentials from the UI
- hardcodign those credentials into the docker-compose file
- restarting
- the
agent-services
container started up and successfully became a registered worker - I killed the docker-compose and deleted the volume folders
- restarted the docker-compose (with the same hard-coded...
You know, you could probably add some immortal containers to the docker-compose.yml
that use images with mongodump
and the ES equivalent installed.
The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.
I like this idea, because docker-compose.yml
could make sure that if the backup container ever dies, it would be restarted.
As opposed to using CRON or something π€£
I took a stab at writing an automated trigger to handle this. The goal is: anytime a pipeline succeeds or fails, let AWS know so that the input records can be placed onto a retry queue (or not)
I'm trying to get a trigger to work in general, and then I'll add the more complex AWS logic. But I seem to be missing a step somewhere:
I wrote a file called set_triggers.py
from clearml.automation.trigger import TriggerScheduler
TRIGGER_SCHEDULER = TriggerScheduler()
from pprint import...
I'm not seeing a extra_docker_shell_script
in my clearml.conf generated by clearml-agent init
like in this guide