So running this train script will sometimes work and sometimes give the error of the original post
Also, do you have a code snippet that reproduces this?
Hi @<1795626098352984064:profile|SoggyElk61> , is it possible you have multiple environments?
Is this sufficient information or can I get help elsewhere?
Thanks for responding. Here you can find some references:
The runner is a ubuntu machine on which a specific user is made. Here we have a venv from which we run:
clearml-agent daemon --queue gpu_12gb --detached --gpus 1
clearml-agent daemon --queue gpu_24gb --detached --gpus 0
clearml-agent daemon --queue no_gpu --detached
This user has a clearml.conf
file in its home directory. When I run clearml-data
commands as this user from the venv everything works as expected.
I also have a second machine, in this case a VM with as a sole purpose being a runner. It is started using clearml-agent daemon --queue gpu_24gb --gpus 0
in a service, and here I get the same issues.
The code used to run is:
task = Task.init(
project_name=config.project,
task_name=config.task_name,
output_uri=config.output_uri,
)
task_id = task.task_id
task.get_logger().set_default_upload_destination(uri=config.output_uri)
task.connect(yml)
if config.clearml_queue != "local":
print(f"Running on ClearML queue: {config.clearml_queue}")
task.execute_remotely(queue_name=config.clearml_queue)
else:
print("Running locally")
...
# Fetch the dataset from ClearML
print(f"Downloading {dataset_id} for {split_type}")
clearml_ds = Dataset.get(dataset_id=dataset_id)
# Then set the alias to the dataset name
ds.alias = f"{clearml_ds.project}/{clearml_ds.name}"
# Refetch but set the alias
clearml_ds = Dataset.get(dataset_id=dataset_id, alias=ds.alias)
ds_path = clearml_ds.get_local_copy()
print(f"Downloaded {dataset_id} for {split_type}")
We added the second fetch because we were getting issues for dataset aliases not being set. However, this doesn’t matter for this issue since it crashed on the first get