Let me check something
RobustRat47 I think you have to use the latest clearml package for that (1.6.0)
Hi @<1523701868901961728:profile|ReassuredTiger98>
Anyone here with any idea why my service tasks get aborted when going to sleep?
I think I understand the issue, clearml==1.4.0
try running with the latest clearml (1.10.x)
It will keep pinging the backend "Im alive" so the backend does not think this process is dead (which I suspect what happened, and after 2 hours the backend basically set the Task to aborted because it "thought" it was killed)
Hi @<1658281093108862976:profile|EncouragingPenguin15>
Should work, I'm assuming multiple nodes are running agents ? or are you saying Ray spins the jobs and clearml logs them ?
DefeatedCrab47 no idea, but you are more then welcome to join the thread here, and point it out:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/issues/249
Thanks GrievingTurkey78 !
It seems that under the hood they user argparser
See here:
https://github.com/google/python-fire/blob/c507c093fa6622ab5efee21709ffbf25974e4cf7/fire/parser.py
Which means it might just work?!
What do you think?
Assuming you are using docker-compose, the console output is a good start
Does it work if I launch the clearml-agent on a docker and pip doesn't know the packages to install
Not sure I follow... the "detect_with_pip_freeze" flag (when set) will tell clearml (at runtime) to create the "installed packages" directly from pip freeze (instead of analyzing the code)
PricklyRaven28 did you set the iam role support in the conf?
https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/docs/clearml.conf#L86
CluelessFlamingo93 I would also fix the pip version requirements to:pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"]
This is good news, that means the k8s glue created a k8s job and pushed the Task into the "k8s_scheduler" queue, for visibility (i.e. it is now the k8s job to launch the pod).
Can you check on the Task Info tab what is the status/message ? (it should reflect the k8s pod status)
Could it be that clone has to be False? (I assume the reasoning is the cloning feature)
Although I didn't understand why you mentioned
torch
in my case?
Just a guess 🙂 other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately
Now will these 10 experiments be of different names? How will I know these are part of the 'mnist1' HPO case?
Yes (they will have the specific HP name/value combination).
FYI names are not unique so in theory you could have multiple experiments with the same name.
If you look under the Configuration Tab, you will find all the configuration arguments for the experiment. You can also add specific arguments to the experiment table (click the cogwheel at the right top corner, and select...
Hi OutrageousGiraffe8
Does anybody knows why this is happening and is there any workaround, e.g. how to manually report model?
What exactly is the error you are getting? and with which clearml version are you using?
Regrading manual Model reporting:
https://clear.ml/docs/latest/docs/fundamentals/artifacts#manual-model-logging
Eg, i'm creating a task usingÂ
clearml.Task.create
 , often it doesn't properly get the git diff correctly,
ShakyJellyfish91 Task.create does not store any "git diff" automatically, is there a reason not to use Task.init
?
First let's try to test if everything works as expected. Since 405 really feels odd to me here. Can I suggest following one of the examples start to end to test the setup, before adding your model?
Assuming Tensorflow (which would be an entire folder)local_folder_or_files = mode.get_weights_package()
It actually started executing your code, but it did not capture it correctly:
/root/.clearml/venvs-builds/3.10/bin/python -u /root/.clearml/venvs-builds/3.10/code/colab_kernel_launcher.py
Which I assume means the actual Task had bad code.
What do you have under the Task execution tab in the UI (the one you were launching, i.e. enqueueing )
 I want to schedule bulk tasks to run via agents, so I'm runningÂ
create
I see, that makes sense.
specially when dealing with submodules,
BTW: submodule diff should always get stored, can you provide some error logs on fail cases?
Before manually modifying the diff:
If you have local commits (i.e. un-pushed) this might fail the diff apply, in that case you can set the following in your clearml.confstore_code_diff_from_remote: true
https://github.com/allegroai/clear...
Hi DashingHedgehong5
Is the text the ,labels on the histogram bucket ?
Notice the xlabels
arguments, id this what you are looking for ?
Hi @<1566596960691949568:profile|UpsetWalrus59>
you should call it before initializing the Task
Task.ignore_requirements("pywin32")
task = Task.init(...)
I think I found something,
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/storage/helper.py#L1442
What's the boto version you have installed?
Anyhow if the StorageManager.upload was fast, the upload_artifact is calling that exact function. So I don't think we actually have an issue here. What do you think?
That said, the arguments are passed Inside the code executed (i.e. monkey patched into the frameworks). This allows it to log and change All the arguments, including the default ones , and allow you to edit them.
Does that make sense ?
PompousHawk82 unfortunately this is kind of binary, either you have full tracking of load/save operations or you do not.
This warning message will disappear in the next version as we will be able to log multiple models under the same Task :)
Can you put here the task.connect line ? (btw: I would assume there is no need for additional connect, if using hydra+fire, no ?)