Hi @<1795626098352984064:profile|SoggyElk61> , is it possible you have multiple environments?
Is this sufficient information or can I get help elsewhere?
Also, do you have a code snippet that reproduces this?
Thanks for responding. Here you can find some references:
The runner is a ubuntu machine on which a specific user is made. Here we have a venv from which we run:
clearml-agent daemon --queue gpu_12gb --detached --gpus 1
clearml-agent daemon --queue gpu_24gb --detached --gpus 0
clearml-agent daemon --queue no_gpu --detached
This user has a clearml.conf
file in its home directory. When I run clearml-data
commands as this user from the venv everything works as expected.
I also have a second machine, in this case a VM with as a sole purpose being a runner. It is started using clearml-agent daemon --queue gpu_24gb --gpus 0
in a service, and here I get the same issues.
The code used to run is:
task = Task.init(
project_name=config.project,
task_name=config.task_name,
output_uri=config.output_uri,
)
task_id = task.task_id
task.get_logger().set_default_upload_destination(uri=config.output_uri)
task.connect(yml)
if config.clearml_queue != "local":
print(f"Running on ClearML queue: {config.clearml_queue}")
task.execute_remotely(queue_name=config.clearml_queue)
else:
print("Running locally")
...
# Fetch the dataset from ClearML
print(f"Downloading {dataset_id} for {split_type}")
clearml_ds = Dataset.get(dataset_id=dataset_id)
# Then set the alias to the dataset name
ds.alias = f"{clearml_ds.project}/{clearml_ds.name}"
# Refetch but set the alias
clearml_ds = Dataset.get(dataset_id=dataset_id, alias=ds.alias)
ds_path = clearml_ds.get_local_copy()
print(f"Downloaded {dataset_id} for {split_type}")
We added the second fetch because we were getting issues for dataset aliases not being set. However, this doesn’t matter for this issue since it crashed on the first get
So running this train script will sometimes work and sometimes give the error of the original post