
Reputation
Badges 1
166 × Eureka!Simpler than I had thought, thanks !
Right. Thanks.
With several models saved by the training process (whose code is not task-aware) I suspect that doing the update call after training completed will only update the last of the uploaded models.
I'm currently looking at a workaround where:
I disable auto saving by https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk/#automatic-logging Manually upload the models Manually register the models with https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345b...
To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task
not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.
erm,
this parallelization has led to the pipeline task issuing a bunch of:model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
and quitting on me.
my train_image_classifier_component
is programmed to save model files to a local path which is returned (and, thanks to clearml, the path's contents are zipped uploded to the files service).
I take it that these files are also brought into pipeline tasks's local disk?
Why is that? If that is indeed what...
That is a good point, I'll make sure we mention it somewhere in the docs. Any thoughts on where?
maybe in (all of) these places:
https://clear.ml/docs/latest/docs/faq
https://clear.ml/docs/latest/docs/fundamentals/task
https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk
That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)
I tried the first option and it worked 🙂 🙏
Oh sure, use
they will be visible on the Dataset page on the version in question
That sounds simple enough.
Though I imagine I'd need to explicitly report every figure. Correct?
I don't mind assigning to the task the same name that I'd assign to the dataset. I just think that the create function should expect dataset_name
to be None in the case of use_current_task=True
(or allow the dataset name to differ from the task name)
here is what I do:
` try:
dataset = Dataset.get(
dataset_project=bucket_name,
dataset_name=dataset_name,
dataset_version=dataset_version,
)
print(
f"dataset found {dataset.project}/{dataset.name} v{dataset.version}\n(id: {dataset.id})"
)
return dataset
except ValueError:
pass
task = Task.current_task()
if task is None:
task = Task.init(
project_name=bucket_name,...
Ooh nice.
I wasn't aware task.models["output"]
also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:
I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?
another weird thing:
Before my training task is done:print(task.models['output'].keys())
outputsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
after task.close()
I can do:task = Task.get_task(task_id) for i in range(100): print(task.models["output"].keys())
which printsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
in the first iteration
and prints the file names in the latter iterations:
` od...
In fact, all my projects seems empty of tasks.
TimelyPenguin76 , this turned out to be the reason I was having locking issues https://clearml.slack.com/archives/CTK20V944/p1658761943458649 :
SweetBadger76 , CostlyOstrich36 : I've attempted essentially the same thing before https://clearml.slack.com/archives/CTK20V944/p1657124102133519 and I thought it had worked in the past so I'm not sure why it is failing me now.
We're using a self-hosted clearml server version 1.14.0
Oh, cool. So would this then report the activities of the spawned processes to the same task as that of the spawning process?
There may be cases where failure occurs before my code starts to run (and, perhaps, after it completes)
This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.
console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get
on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files
Re
re-running this code produces the same printoutsI guess repeatable behaviour is a great default to have for, well, repeatability 🙂
I'm able to "randomize" my results by adding a seed
pipeline argument and calling random.seed(seed)
within the pipeline and component. Results then change with change of seed.
I think most veteran ML practitioners are bitten at some point by randomising when they shouldn't and not randomising when they should. It would be nice to have some docu...
I should also mention I use clearml==1.6.3rc0
That's amazing speed 🚀
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
In case anyone else is interested. We found two alternative solutions:
Repeating the first steps but from within a Docker container ( docker run -it --rm python:3.9 bash
) worked for me.alternatively
The example tasks (or at least those I've tried) that appear in the clear ml examples within a new workspace have clearml==0.17.5
(an old clearml version) listed in "INSTALLED PACKAGES". Updating the clearml package within the task to 1.5.0
let me run the clear-ml agent daemon lo...
Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize
? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folde...
You can have
parents
as one of the
@PipelineDecorator.component
args. The step will be executed only after all the
parents
are executed and completed
Is there an example of using parents some place? Im not sure what to pass and also, how to pass a component from one pipeline that was just kicked off to execute remotely (which I'd like to block on) to a component of the next pipeline's run
I've also not figured out how to modify the examples above to wait for one pipline to end before the next begins