Reputation
Badges 1
25 × Eureka!Interesting...
We could followup the .env configuration, and allow the clearml-task to add configuration files from cmd line. This will be relatively easy to add. We could expand the Environment support (that somewhat exists), and add the ability to read variables from .emv and Add them to an "hyperparemeter" section, named Environment. wdyt?
in the docker-compose file. Still strange...
hmm yes it is... If you have an idea on what went wrong let me know, we would love to fix it
ZanyPig66 is this reproducible? This sounds like a bug, whats the TB version and OS you rae using?
Is this example working for you (i.e. you see debug images)
https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_tensorboard.py
it is a pickle issue
βpackage model doesnβt existβ
Sounds like it, why do you think clearml
has anything there ?
BTW:
import_bind
.
__patched_import3
this is just so when packages that clearml autoconnects with are patched if imported After Task.init was called.
Hi AbruptWorm50
I was wondering if it possible to specify 'patience' of pruning algorithm?
Any of the kwargs passed to **optimizer_kwargs
will be directly passed to the optuna obejct
https://github.com/allegroai/clearml/blob/2e050cf913e10d4281d0d2e270eea1c7717a19c3/clearml/automation/optimization.py#L1096
It should allow you to control the parameters, no?
Regrading the callback, what exactly do you think to put there?
Is the callback this enough?
https://github.com/allegro...
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Every clearml-serving session (you can have multiple different "sessions") is assumed to be homogeneous, this would mean it will serve the same models on as many nodes as possible supporting multiple models per pod.
In your example I think the easiest is to create two serving sessions one with a node selector for the 24GB node and another for the 16GB node, wdyt?
Yeah I think that for some reason the merge of the pbtxt raw file is not working.
Any chance you have an end to end example we could debug? (maybe just add a pbtxt for one of the examples?)
Okay I'll dig into it π
GiganticTurtle0 this is exactly what I did, and ended up with two pipelines, comparing them produced what I expected (different arguments as passed by the script).
What are you getting ?
GrievingTurkey78 where do you see this message? Can you send the full server log
?
but does that mean I have to unpack all the dictionary values as parameters of the pipeline function?
I was just suggesting a hack π the fix itself is transparent (I'm expecting it to be pushed tomorrow), basically it will make sure the sample pipeline will work as expected.
regardless and out of curiosity, if you only have one dict passed to the pipeline function, why not use named arguments ?
Right! I just noticed that! this is odd... and yes defiantly has something to do with the multi pipeline executed on the agent, I think I know what to look for ...
(just making sure (again), running_locally produced exactly what we were expecting, is that correct?)
okay, let me see if I can nail down the issue
Are you running inside a kubernetes cluster ?
Hi RipeGoose2
Are you continuing the Task, i.e. passing Task.init(..., continue_last_task=True)
when you are running the n+1 epoch you get the 2*n+1 reported
RipeGoose2 like twice the gap, i.e internally it adds the an offset of the last iteration... is this easily reproducible ?
I assume every fit starts reporting from step 0 , so they override one another. Could it be?
doing some extra "services"
what do you mean by "services" ? (from the system perspective any Task that is executed by an agent that is running in "services-mode" is a service, there are no actual limitation on what it can do π )
is there a way to assign a job to a specific worker? or is it only working on queue level
Only on a queue level, but you can have as many as you like and spin the agent on it (notice you can have multiple queues on the same agent, pulling based on priority/order).
- Artifacts and models will be uploaded to the output URI, debug images are uploaded to the default file server. It can be changed via the Logger.
- Hmm is this like a configuration file?
You can do.
local_text_file = task.connect_configuration('filenotingit.txt')
Then open the 'local_text_file' it will create a local copy of the data in runtime, and the content will be stored on the Task itself. - This is how the agent installs the python packages, but if the docker already contactains th...
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
however when I clone or reset said task after completion and then enqueue it again, I get the above error.
This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?
I have to admit, I'm not sure...
Let me talk to backend guys, in theory you are correct the "initial secret" can be injected via the helm env var, but I'm not sure how that would work in this specific case
Hi ShaggyHare67 ,
Yes the trains.conf created by trains-agent
is basically an extension of the trains
usage (specifically it adds a section for the agent)
I'm assuming you are running the agent on the same development machine.
I guess the easiest is to rename the trains.conf to trains.conf.old and run trains-agent init
(No need to worry, the trains package supports it , so the new configuration file that will be generated will work just fine)
LazyTurkey38 configuration pushed to github :)
Anyway, in the docs, there is a function called task.register_artifact()
Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact
and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?
I think EmbarrassedSpider34 is correct.
When you pass the requirements to clearml-task, actually the agent depending on how it was configured (conda / pip) will do the installation.
That said, maybe it is worth adding support to provide the env.yml in the CLI ?
(Notice that adding specific channels needs to be configured on the agent, they are not stored per Task)
AlertCamel57 wdyt?