Reputation
Badges 1
25 × Eureka!PlainSquid19 No worries π
btw: If you could see if the mangling of workings / script path happens with the latest trains, that will be appreciated, because if you were running the script in the first place from "stages/" then the trains should have caught it ...
the only problem with it is that it will start the task even if the task is completed
What is the criteria ?
WackyRabbit7
Long story short, yes, only by name (hashing might be too slow on large files)
The easiest solution, if the hash is incorrect, delete the local copy it returns, and ask again, it will download it.
I'm not sure if the hashing is exposed, but if it is not, we can add it.
What do you think?
Failing when passing the diff to the git command...
Hi HealthyStarfish45
- is there an advantage in using tensorboard over your reporting?
Not unless your code already uses TB or has some built in TB loggers.
html reporting looks powerfull, can one inject some javascript inside?
As long as the JS is self contained in the html script, anything goes :)
Legit, if you have a cached_file (i.e. exists and accessible), you can return it to the caller
CourageousLizard33 Are you using the docker-compose to setup the trains-server?
WickedGoat98 if this is the case, you can check this example. Same idea only "manual":
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
NVIDIA_VISIBLE_DEVICES=0,1
Basically it is uses "as is" and Nvidia drivers do the rest
Same goes for all
or 0-3
etc.
BTW: I think we had a better example, I'll try to look for one
You can always log it manually:from clearml import InputModel input_model = InputModel.import_model(weights_url='/tmp/keras_example/weight.6.hdf5')
HI SubstantialElk6
Yes you are correct the glue only needs to change the yaml and it will work.
When you say "Dev end" , what do you mean? I was thinking adding additional glue for multi node and just adding queues , for example add 4nodes queue and attach a glue to it, wdyt?
Regrading horovod, horovod is spinning its own nodes so integration with k8s is not trivial (regardless of ClearML). That said I know that they do have support for horovod in the Enterprise edition, but I'm not sure ...
Yes including this. (There was a fix to an issue with trains-agent
and disabling frameworks, it is already part of 0.16.3 )
There is a git issue for selecting "pip freeze" / auto analyze, we could add "use requirements.txt"
wdyt?
By default SSH server is not running in a lot of scenarios (k8s for example, Windows, MacOS)...
looks like at the end of the day we removedΒ
proxy_set_header Host $host;
Β and use the fqdn for the proxy_pass line
And did that solve the issue?
GaudyPig83
I think there is some mismatch between the code creating the pipeline and the actual Task?! Could that somehow be the case? "relaunch_on_instance_failure" is a missing argument somehow
can you try to launch the entire Pipeline with the latest RC ?pip3 install clearml==1.7.3rc0
BroadMole98
I'm still exploring what trains is for.
I guess you can think of Trains as Experiment manager + MLOps tied together.
The idea is to give a quick and easy way to move from coding/running on one machine to scaling it to multiple remote machines, with everything that comes with it.
In some ways it is like snakemake, it setups your environment and execute the code. Snakemake also allows you to setup data, which in Trains is done via code (StorageManager), pipelines are also...
Hi JitteryCoyote63 , let me check, this backwards compatibility might only apply for API version mismatch between the client and server.
ScantWorm7
Tensorboard is automatically captured and sent to the trains server. This is in addition to the local copy of your TB files. Actually in most cases the local copy is redundant
I callΒ
Task.init
Β after I import tensorflow (and thus tensorboard?)
That should have worked...
Can you manually add a TB report before calling opennmt
function ?
(I want to verify the Task.init is indeed catching the TB calls, my theory is that somewhere inside the opennmt
we loose the TB)
is it possible to perform debugging operations with pycharm integration using remote session?
Sure, use clearml-session it will open an ssh connection to the remote machine, then you can use pycharm
Then running by using the
, am I right?
yep
I have put the
--save-period
while running Yolov5 and ClearML does not save the weight per epoch that I have trained. Why is this happened?
But do you still see it in the clearml UI ? do you see the models logged in the clearml UI ?
Could you run your code not from the git repository.
I have a theory, you never actually added the entry point file to the git repo, so the agent never actually installed it, and it just did nothing (it should have reported an error, I'll look into it)
WDYT?
... if we have direct access to the Kubernetes worker when we run K8S glue?
Correct, if you have a direct access to the Node (on your k8s cluster) from your laptop (assuming the clearml-session is running from the laptop), everything should work
Hi JitteryCoyote63
Is this close ?
https://github.com/allegroai/clearml/issues/283