Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)
Hmm that is odd, but at least we have a workaround π
What's the matplotlib backend ?
Yes, hopefully they have a different exception type so we could differentiate ... :) I'll check
could it be the polling on the Task (can't remember whats the interval), but it will update it's state once every X minutes/seconds
RoundMosquito25 actually you can π# check the state every minute while an_optimizer.wait(timeout=1.0): running_tasks = an_optimizer.get_active_experiments() for task in running_tasks: task.get_last_scalar_metrics() # do something here
base line reference
https://github.com/allegroai/clearml/blob/f5700728837188d7d6005726c581c9d74fd91164/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py#L127
these are being repeated as well for a single task (this is training a t5_model with transformers):Β (edited)
Seems like someone is storing lots of files with torch.save
that ClearML automatically logs.
You can disable the autolog:task = Task.init(..., auto_connect_frameworks={'pytorch': False})
It is the folder the clearml creates and the folder we create ourself to store the predictions
I see... If that is the case, the only solution I can think of is manually uploading the files with StorageManager(...) then get the url, and register it as debug_media or artifact:logger.report_media("image", "type a", iteration=iteration, url="
") task.upload_artifact('a link', artifact_object='
')
New RC hopefully solves it @<1643060801088524288:profile|HarebrainedOstrich43> could you check if it works for you now?
pip install clearml==1.14.0rc0
Wtf? can you try with = (notice single not double)?
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Let me see if I can reproduce something
I see what you mean.an_optimizer = HyperParameterOptimizer( base_task_id='39d2c27baa8145929b2e21f686a17046', hyper_parameters=[], objective_metric_title='epoch_accuracy', objective_metric_series='epoch_accuracy', objective_metric_sign='max', optimizer_class=aSearchStrategy, max_iteration_per_job=0, total_max_jobs=0, auto_connect_task=False, ) print(an_optimizer.get_top_experiments(top_k=5))
@<1523722618576834560:profile|ShaggyElk85> nice !
I think that in theory you can run the DBs arm64 images no?
It will store everything locally, later you can import it back to the server, if you want.
What's the clearml-server version ?
Hi RoughTiger69
How about using the pipeline decorator as a way to run this logic?
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I think I'm missing the context of where the code is executed....
btw: you can now set the configuration_objects directly when calling add_step π
https://clearml.slack.com/archives/CTK20V944/p1633355990256600?thread_ts=1633344527.224300&cid=CTK20V944
it is just local copy so you can rerun and reconfigure
Hi SkinnyPanda43
Yes, I think you are right the documentation might be missing it. I'll make sure they know it π
In the meantime :task.update_output_model
https://github.com/allegroai/clearml/blob/d3929033c016476c580557639ff44f900e65904a/clearml/backend_interface/task/task.py#L734
I cannot modify an autoscaler currently running
Yes this is a known limitation, and I know they are working on fixing it for the next version
We basically have flask commands allowing to trigger specific behaviors. ...
Oh I see now, I suspect the issue is that the flask command is not executed from within the git project?!
clearml-agent deployment file
What do you mean by that? is that the helm of the agent ?
Hi @<1671689437261598720:profile|FranticWhale40>
Are you positive the Triton container finished syncing ?
Could you provide the docker log (both the serving and the triton)?
What is the clearml-serving version you are using ?
Could you add a print in the "preprocess" function, just to validate you are getting to the correct model version ?
And it works correctly when running on my computer, and if I use colab, then for some reason it has no effect.
I think I'm lost on this one, when running in colab, is this continuing a previous experiment ?
Think multiple hyper-paremter sections that we need to reference
(under the Tasks Configuration Tab, the Hyper parameters can have multiple sections)
DefeatedOstrich93 can you verify lightning actually only stored once ?
The issue is the 400 returned form the server, let me check with backend guys
Hey IntriguedRat44 ,
Is this what you are after?
https://github.com/allegroai/trains/issues/181
are you referring toΒ
extra_docker_shell_
scrip
t
Correct
the thing is that this runs before you create the virtual environment, so then in the new environment those settings are no longer there
Actually that is better, because this is what we need to setup the pip before it is used. So instead of passing --trusted-host
just do:
` extra_docker_shell_script: ["echo "[global] \n trusted-host = pypi.python.org pypi.org files.pythonhosted.org YOUR_S...
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...