Reputation
Badges 1
166 × Eureka!Yes.
Some mechanism that would allow for followup code execution. Ideally in a way that would not be susceptible to the same things that may cause a task to fail.
BTW:
If I try to find the right model in the
task.models["output"]
(this time there is just one but in my code there may be several) it appears with the
(see other attached screenshot).
What would make sense here ? (I have to be honest I'm not sure).
If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...
There may be cases where failure occurs before my code starts to run (and, perhaps, after it completes)
here is the code in text if you feel like giving it a try:import tensorboard_logger as tb_logger from clearml import Task task = Task.init(project_name="great project", task_name="test_tb_logging") task_tb_logger = tb_logger.Logger(logdir='./tb/run1', flush_secs=2) for i in range(10): task_tb_logger.log_value("some_metric", 42, i) task.close()
These paths are pathlib.Path
. Would that be a problem?
I have google-cloud-storage==2.6.0
installed
yes. several checkpoints + the one that did best on validation data.
I'm on clearml 1.6.2
The jupyter notebook service and two clear-ml agents ( version1.3.0, one in queue "default" and one in queue "services" and with --cpu-only flag) ) are all running inside a docker container
That's amazing speed 🚀
Simpler than I had thought, thanks !
actually, re-running pipeline_from_decorator.py
a second time (and a third time) from the command line seem to have executed without the that ValueError so maybe that issue was some fluke.
Nevertheless, those runs exit prior to lineprint('process completed')
and I would definitely prefer the command executing_pipeline
to not kill the process that called it.
For example, maybe, having started the pipeline I'd like my code to also report having started the pipeline to som...
Right. Thanks.
With several models saved by the training process (whose code is not task-aware) I suspect that doing the update call after training completed will only update the last of the uploaded models.
I'm currently looking at a workaround where:
I disable auto saving by https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk/#automatic-logging Manually upload the models Manually register the models with https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345b...
I tried playing with those parameters on my laptop to no great effect.
Here is code you can use to reproduce the issue:
` import os
from pathlib import Path
from tqdm import tqdm
from clearml import Dataset, Task
def dataset_upload_test(project_id:str, bucket_name:str
):
def _random_file(fpath, sizekb):
fileSizeInBytes = 1024 * sizekb
with open(fpath, "wb") as fout:
fout.write(os.urandom(fileSizeInBytes))
def random_dataset(dataset_path, num_files, file...
I'll try and reproduce this in simpler code
To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task
not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.
sort of. Though it seems like the rules for model.name can be a bit non-obvious.
I think that the first model saved gets the task name as its name and the following models take f"{task_name} - {file_name}"
would setting the max_workers to 1 be a (slower) workaround?
We're using a self-hosted clearml server version 1.14.0
Oh, cool. So would this then report the activities of the spawned processes to the same task as that of the spawning process?
Would you expect this fastai callback to work?
(Uses SummaryWriter):
https://github.com/fastai/fastai/blob/d7f4863f1ee3c0fa9f2d9feeb6a05f0625a53696/fastai/callback/tensorboard.py
It seems to have failed as well (but I'd need to check more carefully)
thanks. Switching to SummaryWriter shouldn't be hard for us.
any news on this? I also got a similar issue
For me the problem sort of went away. My code evolved a bit after posting this so that dataset creation and training tasks run in separate python sessions. I did not investigate further.
there may have been some interaction between the training task and a preceding dataset creation task :shrug:
another weird thing:
Before my training task is done:print(task.models['output'].keys())
outputsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
after task.close()
I can do:task = Task.get_task(task_id) for i in range(100): print(task.models["output"].keys())
which printsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
in the first iteration
and prints the file names in the latter iterations:
` od...
I don't mind assigning to the task the same name that I'd assign to the dataset. I just think that the create function should expect dataset_name
to be None in the case of use_current_task=True
(or allow the dataset name to differ from the task name)
I was doing it with the task that I had been using. Mostly for logging arguments that control what the dataset will contain.
I mean that it was uploading console logs scalar plots and images fine just a while ago and then it seems to have stopped uploading all scalar plot metrics and the figures but log upload was still fine.
Anyway, it is back to working properly now without any code change (as far as I can tell. I tried commenting out a line or two and then brought them all back)
If I end up with something reproducible I'll post here.