Hi, When Downloading Datasets Using The Python Sdk On A Runner Initiated Using

Answered

Hi, when downloading datasets using the Python SDK on a runner initiated using execute_remotely , I get this issue:

Traceback (most recent call last):
  File ".../train.py", line 124, in <module>
    clearml_ds = Dataset.get(dataset_id=dataset_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dataset.py", line 1806, in get
    instance = get_instance(dataset_id)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dataset.py", line 1704, in get_instance
    raise ValueError("Provided id={} is not a Dataset ID".format(task.id))
ValueError: Provided id=440676a069114e0181234c2b00f94f0bb is not a Dataset ID

When I rerun the script, sometimes it does work however. Why could this issue be intermittent?

I tried recreating the issue by running clearml-data get --id manually on the device and there it does not occur.

  				
Posted 
	2 months ago

					More
				  		
  Report
		
					SoggyElk61
				
					0
					 × 1

Votes Newest

Answers 5

Also, do you have a code snippet that reproduces this?

  				
Posted 
	2 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1795626098352984064:profile|SoggyElk61> , is it possible you have multiple environments?

  				
Posted 
	2 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Thanks for responding. Here you can find some references:

The runner is a ubuntu machine on which a specific user is made. Here we have a venv from which we run:

clearml-agent daemon --queue gpu_12gb --detached --gpus 1
clearml-agent daemon --queue gpu_24gb --detached --gpus 0

clearml-agent daemon --queue no_gpu --detached

This user has a clearml.conf file in its home directory. When I run clearml-data commands as this user from the venv everything works as expected.

I also have a second machine, in this case a VM with as a sole purpose being a runner. It is started using clearml-agent daemon --queue gpu_24gb --gpus 0 in a service, and here I get the same issues.

The code used to run is:

task = Task.init(
    project_name=config.project,
    task_name=config.task_name,
    output_uri=config.output_uri,
)
task_id = task.task_id
task.get_logger().set_default_upload_destination(uri=config.output_uri)
task.connect(yml)

if config.clearml_queue != "local":
    print(f"Running on ClearML queue: {config.clearml_queue}")
    task.execute_remotely(queue_name=config.clearml_queue)
else:
    print("Running locally")
...

# Fetch the dataset from ClearML
    print(f"Downloading {dataset_id} for {split_type}")
    clearml_ds = Dataset.get(dataset_id=dataset_id)

    # Then set the alias to the dataset name
    ds.alias = f"{clearml_ds.project}/{clearml_ds.name}"

    # Refetch but set the alias
    clearml_ds = Dataset.get(dataset_id=dataset_id, alias=ds.alias)

    ds_path = clearml_ds.get_local_copy()
    print(f"Downloaded {dataset_id} for {split_type}")

We added the second fetch because we were getting issues for dataset aliases not being set. However, this doesn’t matter for this issue since it crashed on the first get

  				
Posted 
	2 months ago

					More
				  		
  Report
		
					SoggyElk61
				
					0
					 × 1

Is this sufficient information or can I get help elsewhere?

  				
Posted 
	2 months ago

					More
				  		
  Report
		
					SoggyElk61
				
					0
					 × 1

So running this train script will sometimes work and sometimes give the error of the original post

  				
Posted 
	2 months ago

					More
				  		
  Report
		
					SoggyElk61
				
					0
					 × 1

Write your answer

4K Views

5 Answers

2 months ago