
Reputation
Badges 1
25 × Eureka!So this can be translated to
CLEARML__SDK__AZURE__STORAGE__CONTAINERS__0__ACOUNT_NAME=abcd
btw, I looked deeper into the log:
File "/tmp/tmpfa8ifmka.py", line 80, in <module>
model.train(data='coco128.yaml',epochs=20)
I'm assuming this all starts here, I think that the pipeline is Not running the code from the same folder, and you are just missing the 'coco128.yaml' try to pass a full path, wdyt?
Hmm I see your point.
Any chance you can open a github issue with a small code snippet to make sure we can reproduce and fix it?
Okay now let's try the final lines:$LOCAL_PYTHON -m virtualenv /root/venv /root/venv/bin/python3 -m pip install git+
VexedCat68 the remote checkpoints (i.e. Models) represent the local storage, so if you internally overwrite the files, this is exactly what will happen in the backend. so the following should work (and store the last 5 checkpoints):epochs += 1 torch.save("model_{}.pt",format(epochs % 5))
Regrading deleting / getting models:Model.remove(task.models['output'][-1])
Okay, so you want to take the jupyter notebook (aka colab) and have that experiment show on Trains, then use the Trains UI to launch it remotely on one of the machines running the trains-agent. Is that correct?
PlainSquid19 yes the link is available on in the actual paid product π
I don't think they have the documentation open yet...
My recommendation is to fill the contact us form, you'll get a free online tour as well π
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent
or if running code that uses trains
will work on them?
quick video of the search not working
Thank you! this is very helpful, passing along to front-end guys π
and ctrl-f (of the browser) doesnβt work as lines below not loaded (even when you scroll it will remove the other lines not visible, so you canβt ctrl-f them)
Yeah, that's because they are added lazily
It is stored on the Task itself
So obviously the straight forward solution is to report normalize the step value when reporting to TB, i.e. int(step/batch_size). Which makes sense as I suppose the batch size is known and is part of the hyper-parameters. Normalization itself can be done when comparing experiments in the UI, and in the backend can do that, if given the correct normalization parameter. I think this feature request should actually be posted on GitHub, as it is not as simple as one might think (the UI needs to a...
I guess itβs on me to check whether this slowdown is negligible or not
Usually performance is negligible, especially with GPU
But if you really want the best:
Add --security-opt seccomp=unconfined
to the extra_docker_arguments
See detials:
https://betterprogramming.pub/faster-python-in-docker-d1a71a9b9917
VexedCat68
. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
Are you uploading the checkpoints manually with artifacts? or is it autologged & uploaded ?
Also why no reuse and overwrite older checkpoints ?
I don't see any requests
This points to configuration, specifically maybe it is directed to a different server?!
Hi GrievingTurkey78
I think it is already fixed with 0.17.5, no?
there is almost zero overhead if your docker container alreadyt has everything (including the agent) preinstalled and you set it with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it then should basically just run the code.
LovelyHamster1
Also you can use pip freeze
instead of the static code analysis , on your development machines set:detect_with_pip_freeze: false
https://github.com/allegroai/clearml/blob/e9f8fc949db7f82b6a6f1c1ca64f94347196f4c0/docs/clearml.conf#L169
Hmm, not a bad idea π
Could you please open a Git Issue, so it will not get forgotten ?
(btw: I'm not sure how trivial it is to implement, nonetheless obviously possible π
Any insight will help, if you can provide the log of the Task that did get stuck, that would be a good start
Hmm @<1523701083040387072:profile|UnevenDolphin73> I think this is the reason, None
and this means that even without a full lock file poetry can still build an environment
Hi ShallowArcticwolf27
However, the AMI for version 0.16.1 has the following docker-compose file
I think we moved the docker-compose yaml when we upgraded from trains to clearml. Any reason your are installing the old docker-compose ?
Hi ShinyPuppy47 ,
Yes that is correct. Use Task.init for automagic logging
Okay, so I can't figure why it would "kill" the new experiments, I mean it should run them, but is there any "smart stopping" that causes it to kill he process before it ends ?
BTW: can this be reproduced with the clearml hydra example ?
any idea why i cannot selected text inside the table?
Ichh, seems again like plotly π I have to admit quite annoying to me as well ... I would vote here: None
What do you have in the artifacts of this task id: 4a80b274007d4e969b71dd03c69d504c
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
It's my bad, after that inside the container it does cp -Rf /.ssh ~/.ssh
The reason is that you cannot know the user home folder before spinning the container
Anyhow the point is, are you sure that you have ~/.ssh on the Host machine configured?
And if you do, are you saying this is part of your AMI? if not how did you put it there?
Hmm whats the OS and python version?
Is this simple example working for you?
None