Reputation
Badges 1
166 × Eureka!Oh, cool. So would this then report the activities of the spawned processes to the same task as that of the spawning process?
nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?
BTW:
If I try to find the right model in the
task.models["output"]
(this time there is just one but in my code there may be several) it appears with the
(see other attached screenshot).
What would make sense here ? (I have to be honest I'm not sure).
If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...
sort of. Though it seems like the rules for model.name can be a bit non-obvious.
I think that the first model saved gets the task name as its name and the following models take f"{task_name} - {file_name}"
To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task
not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.
anyhow - looks like the keys are simple enough to use (so I can just ignore the model names)
Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize
? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.
sys.path.insert(0, "/src/clearml_evaluation/")
is actually left-over code from when I was making things run locally (perhaps prior to connecting to github repo) but I think that adding a non-existent path to the system path would be benign
thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?
uploads are a bit slow though (~4 minutes for 50mb)
This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.
console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get
on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files
oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now
I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error
the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps
Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1
that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.
TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
` 2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads
2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_M...
Unfortunately, waiting a while did not make this go away 🙂
Restarting the autoscaler, instances and a running single pipeline - I still get the same error.clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable
the same occures when I run a single training component instead of two
switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)
I get the same error with those added lines
now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
)...
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run
command
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)
here is the log from the failing component:File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable
also - some issue on the autoscaler side:
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
...
also weirdly, the failed pipeline task is sometimes marked as failed
and at other times it is marked as completed