
Reputation
Badges 1
129 × Eureka!I don't know that you'd have to pre-build credentials into docker. If you could specify a set of credentials as environment variables to the docker run ...
command or something, that would work just fine.
The goal is to be able to run docker-compose up
in CI, which starts a clearml-server. And then make several API calls to the started ClearML server to prove that the VS Code extension code is working.
Examples:
- Assert that the extension can auth with ClearML
- Assert that the ext...
I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh
but instead it's -v /root.ssh:/.ssh
I don't see it as an argument in Task.init
or Task.execute_remotely
That's with the key at /root/.ssh/id_rsa
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session
in the first place.
This could be a really low-hanging OSS contribution that could make a real impact π .
Well wow, I figured it out. You equiped me with a solid debugging tool AKA running bash commands within the docker container.
I had to pre-add GitHub and Bitbucket to known hosts by adding keyscan
commands
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa &...
If this works, we might be able to fully replace Metaflow with ClearML!
(Refering to the feature where Metaflow creates Step Functions state machines for you, and then you can use those to trigger event-driven batch jobs in the same way described here)
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
Actually, dumb question: how do I set the setup script for a task?
Will do!
Totally worked!
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML π
Here we go. Trying with this
Here's a docker-compose I've been playing with. It doesn't have the same restart problem you're describing, but I did change the volume mounts: None
I took a stab at writing an automated trigger to handle this. The goal is: anytime a pipeline succeeds or fails, let AWS know so that the input records can be placed onto a retry queue (or not)
I'm trying to get a trigger to work in general, and then I'll add the more complex AWS logic. But I seem to be missing a step somewhere:
I wrote a file called set_triggers.py
from clearml.automation.trigger import TriggerScheduler
TRIGGER_SCHEDULER = TriggerScheduler()
from pprint import...
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? Thatβd include monitoring and alerting. Iβm afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If thatβs the case, then weβd probably only...
Actually that's wrong: really this is the current volume mount
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
Could changing these values to /root/.ssh
work? Do you know what use within the docker image ClearML is using?
Here's the repo: I've recorded a few update videos documenting how we learned about authoring VS Code extensions and how we got it to it's current state. Linked to those in order in the README.
ChatGPT has made working with TypeScript and the VSCode extension framework really nice! None
Yay! Man, I want to do ClearML with "hard mode" (non-enterprise, self-hosted) first, before trying to sell BENlabs (my work) on it. I could see us paying for enterprise to get the Hyper Datasets and Vault features if our scientists/developers fall in love with it--they probably will if we can get them to adopt it since right now we have a homemade system that isn't nearly as nice as ClearML.
@<1523701087100473344:profile|SuccessfulKoala55> how exactly do you configure ClearML to use the cr...
Oh duh, thanks. What about non standard entrypoints (as opposed to arguments) like accelerate launch train.py
?
I could potentially write a selenium script to make a set of keys, but I'd prefer to avoid that π
Thanks for the response @<1523701205467926528:profile|AgitatedDove14> !
What would you consider an event?
I was thinking of the TriggerScheduler
's definition of an event. Pretty much, any thing the TriggerSchedule allows you to react to, it'd be great to be able to publish those events to a queue external to ClearML, e.g. a tag added to a model (or removed), a state in a task changing, etc. We'd want as much metadata about that event as possible. So if the event is due to a task...
Yeah. I'd need to clone this and run it locally to start to understand how it all works. Would be a cool exercise. They advertise that it's really easy to author VS Code extensions. I've seen pretty junior folks do it which makes me think it can't be too bad π
Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning
Oh wow. If this works, that will be insanely cool. Like, I guess what I'm going for is that if I specify "username: test" and "password: test" in that file, that I can specify "api.access_key: test" and "api.secret_key: test" in the clearml.conf used for CI. I'll give it a try tonight!