Reputation
Badges 1
25 × Eureka!WickedGoat98 the mechanism of cloning and parameter overriding is working only when the trains-agent
is launching the experiment. Think of it this way:
Manual execution: trains sends data to server
Automatic (trains-agent) execution: trains pulls data from the server
This applies for both the argparse and connect and connect configuration.
The trains code itself is acting differently when it is executed from the 'trains-agent' context.
Does that help clear things ?
Correct π
btw: my_dict_with_conf_for_data
can be any object, not just dict. It will list all the properties of the object (as long as they do not start with _)
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...
Correct π
You can spin it in two modes, either venv or docker (notice that even in docker mode, it will still clone the code into the docker and install the packages inside the docker, but it also inherits from the docker preinstalled system packages, so that the installation process is a lot faster, but you have the ability to change packages without having to build an entire new docker image)
Hmm, what's the clearml-agent version ?
No I was was pointing out the lack of one
Sounds like a great idea, could you open a github issue (if not already opened) ? just so we do not forget
set the pytorch lightning trainer argument
log_every_n_steps
to
1
(default
50
) to prevent the ClearML iteration logger from timing-out
Hmm that should not have an effect on the training time, all logs are send in the background, that said checkpoints might slow it a bit (i.e.; i...
Hi StaleKangaroo85 which trains
version are you using ? Also which trains-server
are you using?
you can also set theΒ
agent.package_manager.extra_index_url
Β , but since this is dynamic,...
You are correct, sine this is dynamic there is no need to set the " extra_index_url
" configuration in clearml.conf, the additional bash script will configure pip directly. Make sense ?
JitteryCoyote63
somehow the previous iterations, not sure yet if itβs coming from my code, ignite or clearml
ClearML will automatically continue reporting from the previous iteration (i.e. if before continuing the Task the last iteration was 100, then the next report with iteration =0 will actually be 101)
task.set_initial_iteration(engine.state.iteration)
Basically it is called automatically by ClearML (obviously only when you continue an aborted Task)
ModelCheckpoint('best_model', save_best_only=True)
That worked for me now, what's the diff
It is deployed on an on premise, secured network that has no access to the outside world.
Is it password protected or something of that nature?
Perhaps we could find a different solution or work around, rather than solving a technical issue.
Solving it means allowing the python code to ask the JupyterLab server for the notebook file
However, once working with ClearML and using a venv (and not the default python kernel),
Are you saying on your specific setup (i.e. OpenShif...
simply record the type of each argument when you store it, and keep it in the database, unbeknownst to the user, what do you say?
This is now supported, but then you still need to flatten the dict.
Maybe we can just support "empty_dict/new_value = 42" if the original was "empty_dict = {}"
WDYT?
Hi TenderCoyote78
I'm trying to clearml-agent in my dockerfile,
I'm not sure I'm following, Are you traying to create a docker container containing the agent inside? for what purpose ?
(notice that the agent can spin any off the shelf container, there is no need to add the agent into the container it will take of itself when it is running it)
Specifically to your docker file:
RUN curl -sSL
| sh
No need for this line
COPY clearml.conf ~/clearml.conf
Try the ab...
HI FranticCormorant35 , the Reporter is internal implementation the Logger uses. In general you should use the Logger.
Notice that the actual configuration that is used is the https://github.com/allegroai/clearml/blob/b21e93272682af99fffc861224f38d65b42c2354/clearml/backend_config/bucket_config.py#L23
But it is created here:
https://github.com/allegroai/clearml/blob/b21e93272682af99fffc861224f38d65b42c2354/clearml/backend_config/bucket_config.py#L199
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
Nice!
My next question, how do I add more queues?
You can create new queues in the UI and spin a new glue for the queue (basically think of a queue as an abstraction for a specific type of resource)
Make sense ?
GreasyPenguin14 I think the default is reporting on failed tasks only? could that be?
HighOtter69
By default if you are continuing an experiment it will start from the last iteration of the previous run. you can reset it with:task.set_initial_iteration(0)
No, I just want to register a new model in the storage.
Is the model file is already uploaded, you can register it without a Task:InputModel.import_model(...)
https://github.com/allegroai/clearml/blob/b3a2b3425c5098ebfc0598c9dfb3e670d4a87706/clearml/model.py#L521
I need to create a separate task for this right?
If you want the model to be uploaded, then yes you have to create a Task.
Thanks RipeGoose2 !
clearml logging starts from n+n (thats how it seems) for non explicit
I have to say it looks like the expected behavior , I think.
Basically matching the TB, no?
You mean to add the extra index url?
you could use :
https://github.com/allegroai/clearml-agent/blob/5f0d51d485629e9dfc2d826622524461e3fcae8a/docs/clearml.conf#L63
and of course:task.set_parameters_as_dict(params)
I tested and I have no more warning messages
if self._active_gpus and i not in self._active_gpus: continue
This solved it?
If so, PR pretty please π
It looks like the tag being used is hardcoded to 1.24-18. Was this issue identified and fixed in later versions?
BoredHedgehog47 what do you mean by "hardcoded 1.24-18" ? tag to what I think I lost context here
Hi BattyLion34
The windows issue seems like it is coming from missing QT installed on the Host machine
Check the pyqt5
version in your "Installed packages"
see here:
https://superuser.com/questions/1433913/qtpy-pythonqterror-no-qt-bindings-could-be-found
Regrading the linux, it seems your are missing the object_detection
package, where do you usually install it from ?
but we run everything in docker containers. Will it still help?
As long as you are running with clearml-agent(in docker mode), all the cache folders (this one included) are mounted on the host machine for persistency
Well it should work out if the box as long as you have the full route, i.e. Section/param