Reputation
Badges 1
25 × Eureka!JitteryCoyote63 , just making sure, does refresh fixes the issue ?
We should probably change it so it is more human readable 🙂
Firstly, thank you for your efforts and your support.
Thanks SmugOx94 !
Are you running trains-agent
in docker mode? The aforementioned scripts are executed before, the experiment is being cloned, they are meant to be a part of the docker setup, not a per experiment script.
You could try to edit the experiment and have:
Working Directory: "."
(that means the root of the repository)
Script Path: "experiments_that_uses_library/train.py"
This will make sure you can do "import l...
I managed to do it by using logger.report_scalar, thanks!
Sure, but for future reference where (in ignite callbacks) did you add the report_scalar
call ?
Could be nice to write some automation
Sure:Dataset.create(..., use_current_task=True)
This will basically attach/make the main Task the Dataset itself (Dataset is a type of a Task, with logic built on top of it)
wdyt ?
these are being repeated as well for a single task (this is training a t5_model with transformers): (edited)
Seems like someone is storing lots of files with torch.save
that ClearML automatically logs.
You can disable the autolog:task = Task.init(..., auto_connect_frameworks={'pytorch': False})
Hi JitteryCoyote63 , is there a callback for that?
SoreDragonfly16 could you reproduce the issue?
What's your OS? trains versions?
OddShrimp85 you can see the full configuration at the top of the Task log. What do you have there? Also what is the clearml python version?
Hi ReassuredTiger98
However, the clearml-agent also stops working then.
you mean the clearml-agen daemon (the one that spinned the container) is crashing as well ?
SubstantialElk6 (2) yes definitely will be fixed
Regrading (1), what do you mean by "via the code" ? Do you mean like as a Task docker cmd ?
These are both specific cases of the glue, and yes both need to be fixed.
(1) I think is actually a feature, nonetheless we should support it.
FriendlySquid61 could you verify specifically on (2)
Thanks SubstantialElk6 !
I believe an initial a fix was pushed 😉 A full one (merging Task --env with k8s template) will be added soon
Do we have it on the git issue ?
GiddyTurkey39 I think I need some more details, what exactly is the scenario here?
Specifically for this one, this is the auto generated docstring from the actual code, so PR to the
https://github.com/allegroai/clearml/blob/e53a76b713910adaf87578c69e86f8154d4ab4c1/clearml/logger.py#L152
Thanks JitteryCoyote63 let me double check if there is a reason for that (there might be one, not sure)
WickedGoat98 if this is the case, you can check this example. Same idea only "manual":
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
can we somehow in clearml-session choose the pool of ports for work?
Yes, I think you can.
How do you spin the worker nodes? Is it Kubernetes ?
Hi SparklingElephant70
Anyone know how to solve?
I tired git push before,
Can you send the entire log? Could it be that the requested commit ID does not exist on the remote git (for example force push deleted it) ?
Hi PanickyMoth78
So do not tell anyone, but the next version will have reports built in clearml, as well as the ability to embed graphs in 3rd party (think Notion GitHub, markdown etc.)
Until then (ETA mid Dec), the easiest is to download an image or just use the url (it encodes the full view, so when someone clicks on it they the exact view you are seeing)
GiganticTurtle0 this is exactly what I did, and ended up with two pipelines, comparing them produced what I expected (different arguments as passed by the script).
What are you getting ?
Hi DrabCockroach54
... and no logs for python script.
what do you mean by "no logs" , is it clearml logs? or k8s pod logs ?
RoundMosquito25 are you using clearml-agent daemon --stop
or are you killing them ?
killing them basically means you loose them in the UI when they timeout, the backend does not see them for 10min so it assumes they died, when you call clearml-agent --stop they will unregister themselves and disappear immortally
Is there any progress made on the clearml-serving repo?
Hi JitteryCoyote63
yes, things are progressing slower than expected, I'm expecting actual work will be pushed in early Jan. On the bright side we are trying to work closely with TorchServing team and Nvidia Triton to expand capabilities.
Currently it seems the setup will be "proxy server container" for per-post processing, then serving engine container (Triton/Torch), with monitoring container as control plan (i.e. collecting s...
HI SubstantialElk6
Yes you are correct the glue only needs to change the yaml and it will work.
When you say "Dev end" , what do you mean? I was thinking adding additional glue for multi node and just adding queues , for example add 4nodes queue and attach a glue to it, wdyt?
Regrading horovod, horovod is spinning its own nodes so integration with k8s is not trivial (regardless of ClearML). That said I know that they do have support for horovod in the Enterprise edition, but I'm not sure ...
JitteryCoyote63 hacky but sure 🙂
` from trains.config import config_obj
print(config_obj) `
Hi SmugOx94
Hmm are you creating the environment manually, or is it done by Task.init ?
(Basically Task.init will store the entire environment of conda, and if the agent is working with conda package manager it will use it to restore it)
https://github.com/allegroai/clearml-agent/blob/77d6ff6630e97ec9a322e6d265cd874d0ab00c87/docs/clearml.conf#L50
Generally speaking, for the exact reason if you are passing a list of files, or a folder, it will actually zip them and upload the zip file. Specifically to pipeline it should be similar. BTW I think you can change the number of parallel upload threads in StorageManager, but as you mentioned it is faster to zip into one file. Make sense?