Hi, I’M Trying To Create A Dataset On Clearml Server From My Aws S3 Bucket Via:

Answered

Hi, I’m trying to create a dataset on ClearML server from my AWS S3 bucket via:

dataset = Dataset.create(dataset_name="my_dataset", dataset_project="my_project")
dataset.add_external_files(
  source_url="s3:...", 
  #dataset_path=""
)

when I run this snippet, there’s an error.

raise NotImplementedError("Datasets are not supported with your current ClearML server version. Please update your server.")
NotImplementedError: Datasets are not supported with your current ClearML server version. Please update your server.

Does it mean I need to upgrade to pro tier to use this feature?

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

Votes Newest

Answers 21

also I have:

api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server:


    web_server:


    files_server:


    # Credentials are generated using the webapp,


    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "***", "secret_key": "***"}
}

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

I didn’t change anything in my clearml.conf. Is there sth in sdk.development that I need to change:

    development {
        # Development-mode options

        # dev task reuse window
        task_reuse_time_window_in_hours: 72.0

        # Run VCS repository detection asynchronously
        vcs_repo_detect_async: true

        # Store uncommitted git/hg source code diff in experiment manifest when training in development mode
        # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
        store_uncommitted_code_diff: true

        # Support stopping an experiment in case it was externally stopped, status was changed or task was reset
        support_stopping: true

        # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
        default_output_uri: ""

        # Default auto generated requirements optimize for smaller requirements
        # If True, analyze the entire repository regardless of the entry point.
        # If False, first analyze the entry point script, if it does not contain other to local files,
        # do not analyze the entire repository.
        force_analyze_entire_repo: false

        # If set to true, *clearml* update message will not be printed to the console
        # this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
        suppress_update_message: false

        # If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
        detect_with_pip_freeze: false

        # Log specific environment variables. OS environments are listed in the "Environment" section
        # of the Hyper-Parameters.
        # multiple selected variables are supported including the suffix '*'.
        # For example: "AWS_*" will log any OS environment variable starting with 'AWS_'.
        # This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]"
        # Example: log_os_environments: ["AWS_*", "CUDA_VERSION"]
        log_os_environments: []

        # Development mode worker
        worker {
            # Status report period in seconds
            report_period_sec: 2

            # The number of events to report
            report_event_flush_threshold: 100

            # ping to the server - check connectivity
            ping_period_sec: 30

            # Log all stdout & stderr
            log_stdout: true

            # Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
            # Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
            console_cr_flush_period: 10

            # compatibility feature, report memory usage for the entire machine
            # default (false), report only on the running process and its sub-processes
            report_global_mem_used: false
        }
    }

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

So this feature is not available for ClearML-hosted server?

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

I installed cClearML 1.9 and the error doesn’t show anymore. When I run the code it created the dataset instance on dashboard but it doesn’t upload the files to ClearMl data server from my S3 bucket. Am I doing sth wrong?

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

default is clearml data server

Yes the default is the clearml files server, what did you configure it to ? (e.g. should be something like None )

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi VirtuousHedgehong97
I think you need to upgrade your self-hosted clearml-server, could that be the case?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

By the way, when I run the upload command I get the following error :

Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd72e900130>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

suppose I have an S3 bucket where my data is stored and I wish to transfer it to ClearML file server.

Then you first have to download the entire bucket locally, then register the local copy.
Basically:

StorageManager.download_folder("

", "/target/folder")
# now register the local "/target/folder" with Dataset.add_files

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks Martin, so does it mean I won’t be able to see the data hosted on S3 bucket in ClearMl dashboard under datasets tab after registering it?

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

Is that correct?

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

Let say I don’t have the data on my local machine but only S3 bucket. So to see the data in ClearML dashboard, I need to download first from S3 to my local machine and then add files and upload to ClearMl data server which is visible under this tab:

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

I’m new to ClearMl and try to see how it works with S3 (external buckets)

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

VirtuousHedgehong97

source_url="s3:...",

This means your data is already on S3 bucket, it will not "upload" it it will just register it.
If you want to upload files, then they should be local and then when you call upload you can specify the target S3 bucket, and the data will be stored in a unique folder in the bucket
Does that make sense ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

To expand on this, suppose I have an S3 bucket where my data is stored and I wish to transfer it to ClearML file server. I execute the following Python script

from clearml import Dataset

dataset = Dataset.create(dataset_name="my_dataset", dataset_project="my_project")

dataset.add_external_files(
  source_url="

", 
  dataset_path="/my_dataset/"
)
dataset.upload()
dataset.finalize()

and this is aws part of my clearml.conf

aws {
        s3 {
            # S3 credentials, used for read/write access by various SDK elements

            # The following settings will be used for any bucket not specified below in the "credentials" section
            # ---------------------------------------------------------------------------------------------------
            region: ""
            # Specify explicit keys
            key: "AKI***I5"
            secret: "2+1yd***2H6y"
            # Or enable credentials chain to let Boto3 pick the right credentials. 
            # This includes picking credentials from environment variables, 
            # credential file and IAM role using metadata service. 
            # Refer to the latest Boto3 docs
            use_credentials_chain: false
            # Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials".
            extra_args: {}
            # ---------------------------------------------------------------------------------------------------


            credentials: [
                # specifies key/secret credentials to use when handling s3 urls (read or write)
                 {
                     bucket: "my_bucket"
                     key: "AKI***I5"
                     secret: "2+1yd***2H6y"
                 },
                # {
                #     # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
                #     host: "my-minio-host:9000"
                #     key: "12345678"
                #     secret: "12345678"
                #     multipart: false
                #     secure: false
                # }
            ]
        }

I noticed that while a dataset instance is generated on the ClearML dashboard, the data itself is not uploaded to the ClearML file server. I had assumed that this would be a straightforward process, apparently it’s not!

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

Let say I don’t have the data on my local machine but only S3 bucket.

You can still register it, but make sure you do not delete it from the S3 bucket because it will keep a link to it

Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /

what did you put in output_uri ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I didn’t pass anything for output_uri as I assumed the default is clearml data server

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

This is what I’m running :

from clearml import Dataset

 
dataset = Dataset.create(dataset_name="mydataset", dataset_project="test_project")
dataset.add_external_files(
  source_url="s3://???/", 
  dataset_path="/mydataset/"
)

dataset.upload()
dataset.finalize()

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

It is available of course, but I think you have to have clearmls-server 1.9+
Which version are you running ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks Martin, so does it mean I won’t be able to see the data hosted on S3 bucket in ClearMl dashboard under datasets tab after registering it?

Sure you can, let's assume you have everything in your local /mnt/my/data you can just add this folder with add_files then upload to your S3 bucket with upload(output_uri=" None ",...)
make sense ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BTW, when I run dataset = Dataset.create(dataset_name="mydataset", dataset_project="test_project") , it creates the dataset instance on dashboard. The problem is uploading which doesn’t happen and this error shows up:

Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7febe270c340>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /

  				
Posted 
	one year ago

					More  		
  Report
		
					VirtuousHedgehong97
				
					0
					 × 1

Okay, now I'm lost, is this reproducible ? are you saying Dataset with remote links to S3 does not work?
Did you provide credntials to your S3 (in tour clear.conf) ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

21 Answers

one year ago