Reputation
Badges 1
25 × Eureka!For now we've monkey-patched it to our usecase:
LOL, that's a cool hack
That gives us the benefit of creating "local datasets" (confined to the scope of the project, do not appear in
Datasets
tabs, but appear as normal tasks within the project)
So what would be a "perfect" solution here?
I think I'm missing the point on why it became an issue in the first place.
Notice that in new versions Dataset will be registered on the Tasks that use them (they are already...
Nice! So out of curiosity why didn't it work this time and you had to do it manually?
repeat it until they are all dead π
Aws autoscaler will work with iam rules along as you have it configured on the machine itself. Sagemaker job scheduling (I'm assuming this is what you are referring to, and not the notebook) you need to select the instance as well (basically the same as ec2). What do you mean by using the k8s glue, like inherit and implement the same mechanism but for sagemaker I stead of kubectl ?
Even if you had any packages, I'm pretty sure there is nothing for you to worry about, it will just list them, and if they are preinstalled, the preinstalled will be used
Gitlab has support for S3 based cache btw.
This might still be considered "slow" compared to local-dist/cluster mount
Would adding support for some sort of post task script help? Is something already there?
Interesting, can you expand on the use case? (currently there is only pre-task script, for setup)
OddAlligator72 just so I'm sure I understand your suggestion:
pickle the entire locals() on current machine.
On remote machine, create a mock entry point python, restore the "locals()" and execute the function ?
BTW:
Making this actually work regardless on a machine is some major magic in motion ... π
This is odd, Can you send the full Task log? (remove any pass/user/repo that you think is sensitive)
Hi @<1661904968040321024:profile|SpotlessOwl43>
My problem is that when the AWS virtual machine is killed, my Pipelines and Scheduling stop working because of the killed ClearML agent,
are you using the ClearML AWS autoscaler to spin that machine ? or are you spinning it manually ?
Is this some sort of polling ?
yes
End of the day, we are just worried whether this will hog resources compared to a web-hook ? Any ideasΒ (edited)
No need to worry, it pulls every 30 sec, and this is negligible (as a comparison any task will at least send a write request every 30 sec, if not more)
Actually webhooks might be more taxing on the server, as you need to always have a webhook up (i.e. wasting a socket ...)
@<1595587997728772096:profile|MuddyRobin9> are you sure it was able to spin the EC2 instance ? which clearml version autoscaler are you running ?
MysteriousBee56 not a different port, just not with "localhost" but with your machine's IP
TenseOstrich47 as long as on the machine running the agent has credentials to your ECR, when the agent will run Any docker container, it will able to pull it. There is no need to manually change anything, notice the Task itself contains the name of the image it will use
What happened in the server configuration that all of a sudden you have zero ports open?
? Do you have a link how to setup a task scheduler to run in service mode in k8s?
basically spin the agent pod and add an argument to the agent itself (this is the --service-mode)
https://clear.ml/docs/latest/docs/clearml_agent#services-mode
FreshParrot56 we could add this capability, but the main caveat is that f your version depends on multiple parent versions you still need to download and extract all the parent versions, which means that when you clear them you might hurt later performance. Does that make sense? What is the use-case / scenario for you?
BTW: server-side vault is in progress, hopefully will be available in the upcoming releases :)
5 seconds will be a sleep between two consecutive pulls where there are no jobs to process, why would you increase it to a higher pull freq ?
seems like pip 20.1.1 has the issue, but >= 22.2.2 do not.
Notice we changed the value there, it now has two versions, pne for python 3.10 < and one for python 3.10>=
The main reason is that pip changed their resolving algorithm, and the new one can break its own dependencies (i.e. pip freeze > requirements.txt -> pip install might not actually work)
None
Could it be the credentials are actually incorrect? because it seems like you can access the server? (I assume you were able to browse to it and generate credentials. right?)
Hi ClumsyElephant70
Any idea how to get the credentials in there?
How about to map it into the docker with -v you can set it here:
https://github.com/allegroai/clearml-agent/blob/0e7546f248d7b72f762f981f8d9033c1a60acd28/docs/clearml.conf#L137extra_docker_arguments: ["-v", "/host/folder/cred.json:/gcs/cred.json"]
JitteryCoyote63 are you suggesting it happens ?
(obviously it should not π )
So a bit of explanation on how conda is supported. First conda is not recommended, reason is, is it very easy to create a setup on conda that is un-reproducible by conda (yes, exactly that). So what trains-agent does, it tries to install all the packages it can first with conda (not one by one, because that will break conda dependencies), then the packages that it failed to install from conda, it will install using pip.
Hmm, in the credentials popup there should be a "secure connect" checkbox, it tells it to use https instead of http. Can you verify?
Hi SubstantialBaldeagle49
yes, you can backup the entire trains-server (see the github docs on how) You mean upgrading the server? Yes, you can change the name or add comments (Info tab / description ), and you can add key/value description (under the configuration tab, see user properties)
TartSeal39 please let me know if it works, conda is a strange beast and we do our best to tame it.
Specifically when you execute manually on a conda env we collect (separately) the conda packages & the python packages (so later we can replicate on both conda & pip, or at least do our best)
Are you running both development env and agent with conda ?