Reputation
Badges 1
166 × Eureka!TimelyPenguin76 , this turned out to be the reason I was having locking issues https://clearml.slack.com/archives/CTK20V944/p1658761943458649 :
SweetBadger76 , CostlyOstrich36 : I've attempted essentially the same thing before https://clearml.slack.com/archives/CTK20V944/p1657124102133519 and I thought it had worked in the past so I'm not sure why it is failing me now.
thanks KindChimpanzee37 . Where is that minimal example to be found?
here is what I do:
` try:
dataset = Dataset.get(
dataset_project=bucket_name,
dataset_name=dataset_name,
dataset_version=dataset_version,
)
print(
f"dataset found {dataset.project}/{dataset.name} v{dataset.version}\n(id: {dataset.id})"
)
return dataset
except ValueError:
pass
task = Task.current_task()
if task is None:
task = Task.init(
project_name=bucket_name,...
I've also not figured out how to modify the examples above to wait for one pipline to end before the next begins
You can have
parents
as one of the
@PipelineDecorator.component
args. The step will be executed only after all the
parents
are executed and completed
Is there an example of using parents some place? Im not sure what to pass and also, how to pass a component from one pipeline that was just kicked off to execute remotely (which I'd like to block on) to a component of the next pipeline's run
yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...
Re
re-running this code produces the same printoutsI guess repeatable behaviour is a great default to have for, well, repeatability 🙂
I'm able to "randomize" my results by adding a seed pipeline argument and calling random.seed(seed)
within the pipeline and component. Results then change with change of seed.
I think most veteran ML practitioners are bitten at some point by randomising when they shouldn't and not randomising when they should. It would be nice to have some docu...
oops, should it have been multi_instance_support=True ?
Oh sure, use
they will be visible on the Dataset page on the version in question
That sounds simple enough.
Though I imagine I'd need to explicitly report every figure. Correct?
Yeah. I was only using the task for the process of creating the dataset.
My code does start out with a step that checks for the existence of the dataset, returning it if it exists (search by project name/dataset name/version) rather than recreating it.
I noticed the name mismatch when that check kept failing me...
I think that init-ing the encompassing task with the relevant dataset name still allows me to search for the dataset by dataset_name=task_name / project_name (shared by both datas...
Thanks AgitatedDove14
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
If Dataset.upload() does not crash or return a success value that I can check and if Dataste.get_local_copy() also does not complain as it retrieves partial data - how will I ever know that I lost part of my dataset?
I'm connecting to the hosted clear.ml
packages in use are:# Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] clearml == 1.6.2 fastai == 2.7.5
in case it matters, I'm running this code in a jupyter notebook within a docker container (to keep things vell isolated). The /data path is volume mapped to my local filesystem (and, in fact, already contains the dataset files, so the fastai call to untar_data should see the data there and return immediately)
That same make_data fu...
TimelyPenguin76 , CostlyOstrich36 thanks again for trying to work through this.
How about we change approach to make things easier?
Can you give me instructions on how to start a GCP Autoscaler of your choice that would work with the clearml pipline example such as the one I shared earlier https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py ?
At this point, I just want to see an autoscaler that actually works (I'd need resources for the two queues, default and ...
BTW:
If I try to find the right model in the
task.models["output"]
(this time there is just one but in my code there may be several) it appears with the
(see other attached screenshot).
What would make sense here ? (I have to be honest I'm not sure).
If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...
Just updating here that I got the AWS autoscaler working with CostlyOstrich36 ’s generous help 🎉
I thought I'd share here some details in case others experience similar difficulties
With regards to permissions, this is the list of actions that the autoscaler uses which your aws account would need to permit:GetConsoleOutput RequestSpotInstances DescribeSpotInstanceRequests RunInstances DescribeInstances TerminateInstances DescribeInstancesthe instance image ` ami-04c0416d6bd8e...
I have google-cloud-storage==2.6.0 installed
I should also mention I use clearml==1.6.3rc0
nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?
I don't mind assigning to the task the same name that I'd assign to the dataset. I just think that the create function should expect dataset_name to be None in the case of use_current_task=True (or allow the dataset name to differ from the task name)
In case anyone else is interested. We found two alternative solutions:
Repeating the first steps but from within a Docker container ( docker run -it --rm python:3.9 bash ) worked for me.alternatively
The example tasks (or at least those I've tried) that appear in the clear ml examples within a new workspace have clearml==0.17.5 (an old clearml version) listed in "INSTALLED PACKAGES". Updating the clearml package within the task to 1.5.0 let me run the clear-ml agent daemon lo...
Here are screen shots of a VM I started with a gpu and one stared by the autoscaler with the setting above but whose GPU is missing (both in the zame gcp zone, us-central1-f ) . I may have misconfigured something or perhaps the autoscaler is failing to specify the GPU requirement correctly. :shrug:
thanks for explaining it. Makes sense 👍 I'll give it a try
I noticed that the base docker image does not appear in the autoscaler task' configuration_object
which is:
` [{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gp...
TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
` 2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads
2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_M...
Hi Martin. See that ValueError https://clearml.slack.com/archives/CTK20V944/p1657583310364049?thread_ts=1657582739.354619&cid=CTK20V944 Perhaps something else is going on?
Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid ...