Hi. Yes that totally makes sense. It’s just that we don’t want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
I guess the conclusion is: I realize it’s possible...
Oh interesting. Is the hope that doing that would somehow result in being able to use those credentials to make authenticated API calls?
This thread should be immortalized. Super stoked to try this out!
I did a quick local experiment and observed that credentials created from the UI indeed become invalid if you delete the ClearML volumes.
- starting docker-compose locally
- creating a set of credentials from the UI
- hardcodign those credentials into the docker-compose file
- restarting
- the
agent-services
container started up and successfully became a registered worker - I killed the docker-compose and deleted the volume folders
- restarted the docker-compose (with the same hard-coded...
For now, I've written a headless selenium script to generate credentials for the fresh ClearML instance in CI.
The agent commands are nothing special.
clearml-agent daemon --queue sessions --cpu-only --create-queue true --docker
So the problem came back even with this new URL. I discovered clearing your cookies fixes it.
Totally worked!
Will do!
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do 😄
And look! It's working, assuming you start the clearml session up yourself:
Well wow, I figured it out. You equiped me with a solid debugging tool AKA running bash commands within the docker container.
I had to pre-add GitHub and Bitbucket to known hosts by adding keyscan
commands
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa &...
Oh I wasn’t aware of that. I don’t think it’d work for this use case though. We’re trying to test the behavior you can see here in this extension https://share.descript.com/view/g0SLQTN6kAk so basically the examples I said in that earlier message
^^^ For my own notes: this is the web request made by the frontend to create a set of credentials
I SOLVED IT, NO NEED TO READ FURTHER 😄
I'm a chump and didn't read the docs: None
Oh, I think I got overexcited and didn't look at this closely. So this ACCESS/SECRET key pair is on the agent-services
container.
I can see that agent-services
is simply a container running `clearml-agent daemon --queue ser...
I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh
but instead it's -v /root.ssh:/.ssh
The question I'm exploring remains: is it possible to acquire that initial set of ClearML API keys programmatically so that the manual steps of 1-4 above can be avoided for an initial deployment?
Man, I owe you lunch sometime @<1523701205467926528:profile|AgitatedDove14> . Thanks for being so detailed in your answers.
Okay! So the pipeline ID is really just a task ID. So cool!
Not sure I fully understand what you mean here...
Sorry, I'll try again. Here's an illustrated example with AWS Step Functions (pretend this is a ClearML pipeline). If the pipeline fails, I'd want to have a chance to do some logic to react to that. Maybe in a step called "on_pipeline_failed" or someth...
Oh duh, thanks. What about non standard entrypoints (as opposed to arguments) like accelerate launch train.py
?
Here's a screenshot if a session where I first try to clone as ssm-user
, but it fails, then I change to root
and it succeeds
When you run the docker-compose.yml
on an EC2 instance, you can configure user login for the ClearML webserver. But the files API is still open to the world, right? (and same with the backend?)
We could solve this by placing the EC2 instance into a VPN.
One disadvantage to that approach is it becomes annoying to reach the model registry from outside the VPN, like if you have a deployment pipeline based in GitHub Actions. Or if you wanted to trigger a ClearML pipeline from a VPC that isn...
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions ...
Here we go. Trying with this
So here's a snippet from my aws_autoscaler.yaml
file
For these functions, Metaflow offers:
- triggering: integration with AWS event bridge. It's really easy to use Boto3 and AWS access keys to emit events for Metaflow DAGs. It's nice not to have to worry about networking for this.
- Scheduling: The fact that Metaflow uses stepfunctions is reassuring.
- observability: this lovely flame graph where you can view the logs and duration of each step in the DAG, it's easy to view all the DAG runs including the ones that have failed. Ideally, we w...