Reputation
Badges 1
25 × Eureka!I cannot modify an autoscaler currently running
Yes this is a known limitation, and I know they are working on fixing it for the next version
We basically have flask commands allowing to trigger specific behaviors. ...
Oh I see now, I suspect the issue is that the flask command is not executed from within the git project?!
Hi AverageBee39
Did you setup an agent to execute the actual Tasks ?
Was going crazy for a short amount of time yelling to myself: I just installed clear-agent init!
oh noooooooooooooooooo
I can relate so much, happens to me too often that copy pasting into bash just uses the unicode character instead of the regular ascii one
I'll let the front-end guys know, so we do not make ppl go crazy π
Okay, what you can do is the following:
assuming you want to launch task id aabb12
The actual slurm command will be:trains-agent execute --full-monitoring --id aabb12
You can test it on your local machine as well.
Make sure the trains.conf is available in the slurm job
(use trains-agent --config-file to point to a globally shared one)
What do you think?
Can you please tell me if you know whether it is necessary to rewrite the Docker compose file?
not by default, it should basically work out of the nox as long as you create the same data folders on the host machine (e.g. /opt/clearml)
Hi CheekyFox58
If you are running the HPO+training on your own machine, it should work just fine in the Free tier
The HPO with the UI and everything, is designed to run the actual training on remote machines, and I think this makes it a Pro feature.
Hi PunyGoose16 ,
next release includes it (eta after this weekend π )
Now in case I needed to do it, can I add new parameters to cloned experiment or will these get deleted?
Adding new parameters is supported π
Full markdown edit on the project so you can create your own reports and share them (you can also put links to the experiments themselves inside the markdown). Notice this is not per experiment reporting (we kind of assumed maintaining a per experiment report is not realistic)
JitteryCoyote63 what am I missing?
What are the errors you are getting (with / without the envs)
What's the trains-server version ?
GloriousPanda26 Are you getting multiple Tasks or is it a single Task ?
Hi SpotlessLeopard9
I got many tasks that were just hang at the end of the script without ...
I remember this exact issue was fixed with 1.1.5rc0, see here:
https://clearml.slack.com/archives/CTK20V944/p1634910855059900
Can you verify with the latest RC?pip install clearml==1.1.5rc3
Ohh try to add --full-monitoring to the clearml-agent execute
None
I think that listing them all would just clutter up the results tab for that pipeline task
Can you share a screen so we better understand the clutter ?
Also "1000 components" ?! and not using them ? could you expand on how/why?
at means I need to pass a single zip file toΒ
path
Β argument inΒ
add_files
Β , right?
actually the opposite, you pass a folder (of files) to add_files. Then add_files remembers the files location (and pre calculates the hash of the files content). When you call upload it will actually compress the files that changed into a zip file (or files depending on the chunk size), and upload the files to the destination (as specified in the upload call...
Hi @<1625303806923247616:profile|ItchyCow80>
Could you add some prints ? Is it working without the Task.init call? the code looks okay and the - No repository found, message basically says it logs it as a standalone script (which makes sense)
I'm so glad you mentioned the cron job, it would have taken us hours to figure
Hi DepressedChimpanzee34
How do I reproduce the issue ?
What are we expecting to get there ?
Is that a Colab issue or hyper-parameter encoding issue ?
WackyRabbit7 If you have an idea on an interface to shut it down, please feel free to suggest?
If i point directly to the data.yaml the training starts without any problem
what do you mean? how do you know where the extracted file is?
basically:
data_path = Dataset.get(...).get_local_copy()
then you should be able to open your file with open(data_path + "/data.yaml", "rt")
doe that work?
Hi ZealousSeal58
What's the clearml version you are using ?
If there was a "debug mode" for viewing the stack trace before the crash that would've been most helpful...
import traceback traceback.print_stack()
HiΒ SmoggyGoat53
There is a storage limit on the file server (basically 2GB per file limit), thisΒ is the cause of the error.
You can upload the 10GB to any S3 alike solution (or a shared folder). Just set the "output_uri" on the Task (either at Task.init or with Task.output_uri = " s3://bucket ")
Hmm so yes that is true, if you are changing the bucket values you will have to manually also adjust it in grafana. I wonder if there is a shortcut here, the data is stored in Prometheus, and I would rather try to avoid deleting old data, Wdyt?
I can't find out how to pass my custom clearml.conf
Hi @<1544491301435609088:profile|TeenyElk27>
The easiest is to map it into the container in your docker-compose
(map a host clearml.conf into /root/clearml.conf inside the container)
Hi @<1554275779167129600:profile|ProudCrocodile47>
Do you mean @ clearml.io ?
If so, then this is the same domain (.ml is sometimes flagged as spam, I'm assuming this is why they use it)