DefeatedCrab47 yes that is correct. I actually meant if you see it on the tensorboard's UI ๐
Anyhow if it there, you should find it in the Tasks Results Debug Samples
Can you let me know if i can override the docker image using template.yaml?
No, you cannot.
But you can pass OS environment "CLEARML_DOCKER_IMAGE" to set a diff default one
connect_configuration
seems to take about the same amount of time unfortunately!
I think it is a better solution, that said from your description it sounds the issue is the upload bandwidth (i.e. json-ing the dict itself), could that be it?
(and even 1000 entries seems like something that would end up at 1mb upload, that is not that much)
The reason is because it is logged as an image, not a plot ๐
@<1577468638728818688:profile|DelightfulArcticwolf22>
How can I tell clearml-agent not to run pip install unless my requierments.txt file was changed.
the agent has built in cache, it will reuse the previous venv if nothing changed (cache local on the agent's machine).
Make sure this is line is not commented :
None
I think they (DevOps) said something about next week, internal roll-out is this week (I think)
Based on your code snippet:Logger.current_logger().report_confusion_matrix(title='confusion', series=confusion', value=confmat_tensor.cpu().numpy(), iteration=i)
or Task.current_task().get_logger()
which is the same as Logger.current_logger()
Just dropping this here but I've had some funky compressions with very small datasets!
Odd deflate behavior ...?!
HugeArcticwolf77 from the CLI you cannot control it (but we could probably add that), from code you can:
https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L646
pass compression=ZIP_STORED
Hi @<1523701523954012160:profile|ShallowCormorant89>
This is generally based on number of agents, or am I missing something ? Also is it based on Task or decorated functions ?
With offline mode,
Later if you need you can actually import the execution (including artifacts etc.) you just need the zip file it creates when you are done.
Hi @<1634001100262608896:profile|LazyAlligator31>
Is this because the code repo is being recreated in this directory?
Yes this is correct ๐
Basically the entire code base + venv is installed there, to make sure it does not intyerfere with the "system" preinstalled environment
(it also allows for caching on the host machine ๐ )
Hi @<1569496075083976704:profile|SweetShells3>
Are you using the standard docker-compose ? are using the default elastic container ?
What exactly changed ?
Honestly, this is all related to issue #340.
makes total sense.
But actually this id different from #340. The feature is to store the Data on the Task, this means each Task in your "pipeline" will be upload a new copy of the data. No?
I'd suggest someย
task.detach()
ย method for remote execution maybe
That is a good idea, in theory it can also be used in local execution
. but when we try to do a "New Run" from UI, it tries to follow the DAG of previous run (the run with all child nodes skipped) and the new run fails too.
This is odd, is this reproducible ? what's the clearml python package version ?
@<1523701523954012160:profile|ShallowCormorant89> can you verify it is reproducible in 1.9.3 ? because if it is I'd like to fix that ๐
will it be possible for us to configure the "new run" button in a way so that it always clones from a particular pipeline ?
What do you mean by "particular pipeline" ? by default it will clone the last successful one, and by right clicking a specific one you can run a copy of that one. what am I missing ?
when u say useย
Task.current_task()
ย you for logging? which iโm guessing that the fastai binding should do right?
right, this is a fancy way to say, make sure the actual sub-process is initializing ClearML so all the automagic kicks in, since this is not "forked" but a whole new process, calling Task.current_task is the equivalent of calling Task.init with the same arguments (which you can also do, I'm not sure which one is more straight forward, wdyt?)
The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?
Hi ShinyRabbit94
system_site_packages: true
This is set automatically when running in "docker mode" no need to worry ๐
What is exactly the error you are getting ?
Could it be the container itself has the python packages installed in a venv not as "system packages" ?
BeefyCow3 if you are trying to optimizer a specific metric (i.e. a scalar on a graph). The template Task should report it with the same title/series combination, which should be easy enough to verify in the UI ๐
You can either report with Tensorboard or with the Trains Logger, either way will work.
That sounds like an internal tritonserver error.
https://forums.developer.nvidia.com/t/provided-ptx-was-compiled-with-an-unsupported-toolchain-error-using-cub/168292
For example:examples/k8s_glue_example.py --queue k8s_gpu - --namespace pod-clearml-conf ~/trains.conf --template-yaml example/base.yml
OSError: [Errno 28] No space left on device
Hi PreciousParrot26
I think this says it all ๐ there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
@<1699955693882183680:profile|UpsetSeaturtle37> good progress, regrading the error, 0.15.0 is supposed to be out tomorrow, it includes a fix to that one.
BTW: can you run with --debug
This is strange... Could you send the browser console log, maybe there is an exception there
I see, so in theory you could call add_step with a pipeline parameter (i.e. pipe.add_parameter etc.)
But currently the implementation is such that if you are starting the pipeline from the UI
(i.e. rerunning it with a different argument), the pipeline DAG is deserialized from the Pipeline Task (the idea that one could control the entire DAG externally without changing the code)
I think a good idea would be to actually allow the pipeline class to have an argument saying always create from cod...
this
from fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesnโt exist anymore in fastai2
Hmm we should definitely update the example to fastai2 API
maybe the fastai bindings in clearml package are outdated
Are you getting any scalars reported to clearml?
they also appear to be relying on the tensorboard callback which seems not to work on distributed training
Yes that is correct, usually the way it works all nodes report back to "master...
I think you cannot change it for a running process, do you want me to check for you if this can be done ?