BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
Hi @<1603198134261911552:profile|ColossalReindeer77>
Hello! does anyone know how to do
HPO
when your parameters are in a
Hydra
Basically hydra parameters are overridden with "Hydra/param"
(this is equivalent to the "override" option of hydra in CLI)
Could you disable the windows anti-virus firewall and test?
You do not need the cudatoolkit package, this is automatically installed if the agent is using conda as package manager. See your clearml.conf for the exact configuration you are running
https://github.com/allegroai/clearml-agent/blob/a56343ffc717c7ca45774b94f38bd83fe3ce1d1e/docs/clearml.conf#L79
So the original looks good, could it be you tried to clone a Task that was executed with an agent with pip, and then pushed into an agent running conda?
You should manually remove the cudatoolkit from the installed packages section in the UI, then try to send it to the agent and see if it works. The question is how it ended there in the first place
maybe I should use explicit reporting instead of Tensorboard
It will do just the same π
there is no method for settingΒ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time
Before I dive into some agent in agent hacking, I would consider "caching" this preprocessing on an auxiliary Task as an artifact. Basically add another argument for the auxiliary Task, and fetch the data from it (obviously you will need to run it once before the optimizer launches the first experiment).
Now that is out of the way (which really would be the preferred engin...
Hi UnsightlyHorse88
Hmm, try adding to your clearml.conf file:agent.cpu_only = trueif that does not work try adding to the OS environmentexport CLEARML_CPU_ONLY=1
How did you define the decorator of "train_image_classifier_component" ?
Did you define:@PipelineDecorator.component(return_values=['run_model_path', 'run_tb_path'], ...Notice two return values
I located the issue, I'm assuming the fix will be in the next RC π
(probably tomorrow or before the weekend)
Hi PanickyMoth78
So the current implantation of the pipeline parallelization is exactly like python async function calls:for dataset_conf in dataset_configs: dataset = make_dataset_component(dataset_conf) for training_conf in training_configs: model_path = train_image_classifier_component(training_conf) eval_result_path = eval_model_component(model_path)Specifically here since you are passing the output of one function to another, image what happens is a wait operation, hence it ...
Come to think about it, maybe we should have "parallel_for" as a utility for the pipeline since this is so useful
I found something btw, let me check...
PompousBeetle71 you can also use ModelOutput.update_weights_package to store multiple files at once (they will all be packaged into a single zip, and unpacked when you get them back via ModelInput). Would that help?
These paths are
pathlib.Path
. Would that be a problem?
No need to worry, it should work (i'm assuming "/src/clearml_evaluation/" actually exists on the remote machine, otherwise useless π
1st: is it possible to make a pipeline component call another pipeline component (as a substep)
Should work as long as they are in the same file, you can however launch and wait any Task (see pipelines from tasks)
2nd: I am trying to call a function defined in the same script, but unable to import it. I passing the repo parameter to the component decorator, but no change, it always comes back with "No module named <module>" after my
from module import function
c...
I assume now it downloads "more" data as this is running in parallel (and yes I assume that before it deleted the files it did not need)
But actually, at east from a first glance, I do not think it should download it at all...
Could it be that the "run_model_path" is a "complex" object of a sort, and it needs to test the values inside ?
Yes the "epoch_loss" is the training epoch loss (as expected I assume).
thought that was just the loss reported at the end of the train epoch via tf
It is, isn't that what you are seeing ?
Thanks @<1523701713440083968:profile|PanickyMoth78> for pining, let me check if I can find something in the commit log, I think there was a fix there...
Hi LivelyLion31 I missed your S3 question, apologies. What did you guys end up doing?
BTW you could always upload the entire TB log folder as artifact, it's simple task.upload_artifact('tensorboard', './tblogsfolder')
Ohh that cannot be pickled... how would you suggest to store it into a file?
Hi SoggyFrog26
Yes, it is stored at ~/.clearml_data.json
Notice you can always change it by passing --id dataset_id
Basically try with the latest RC π
pip install trains 0.15.2rc0
RipeGoose2 models are automatically registered
i.e. added to the models artifactory, but it only points to where the files are stored
Only if you are passing the output_uri argument to the Task.init, they will be actually uploaded.
If you want to disable this behavior you can passTask.init(..., auto_connect_frameworks={'pytorch': False})