Let say I don’t have the data on my local machine but only S3 bucket.
You can still register it, but make sure you do not delete it from the S3 bucket because it will keep a link to it
Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /
what did you put in output_uri
?
This is what I’m running :
from clearml import Dataset
dataset = Dataset.create(dataset_name="mydataset", dataset_project="test_project")
dataset.add_external_files(
source_url="s3://???/",
dataset_path="/mydataset/"
)
dataset.upload()
dataset.finalize()
I’m new to ClearMl and try to see how it works with S3 (external buckets)
Okay, now I'm lost, is this reproducible ? are you saying Dataset with remote links to S3 does not work?
Did you provide credntials to your S3 (in tour clear.conf) ?
So this feature is not available for ClearML-hosted server?
Let say I don’t have the data on my local machine but only S3 bucket. So to see the data in ClearML dashboard, I need to download first from S3 to my local machine and then add files and upload to ClearMl data server which is visible under this tab:
It is available of course, but I think you have to have clearmls-server 1.9+
Which version are you running ?
default is clearml data server
Yes the default is the clearml files server, what did you configure it to ? (e.g. should be something like None )
Thanks Martin, so does it mean I won’t be able to see the data hosted on S3 bucket in ClearMl dashboard under datasets tab after registering it?
Sure you can, let's assume you have everything in your local /mnt/my/data
you can just add this folder with add_files
then upload to your S3 bucket with upload(output_uri="
None ",...)
make sense ?
I didn’t change anything in my clearml.conf. Is there sth in sdk.development that I need to change:
development {
# Development-mode options
# dev task reuse window
task_reuse_time_window_in_hours: 72.0
# Run VCS repository detection asynchronously
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
support_stopping: true
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
default_output_uri: ""
# Default auto generated requirements optimize for smaller requirements
# If True, analyze the entire repository regardless of the entry point.
# If False, first analyze the entry point script, if it does not contain other to local files,
# do not analyze the entire repository.
force_analyze_entire_repo: false
# If set to true, *clearml* update message will not be printed to the console
# this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
suppress_update_message: false
# If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
detect_with_pip_freeze: false
# Log specific environment variables. OS environments are listed in the "Environment" section
# of the Hyper-Parameters.
# multiple selected variables are supported including the suffix '*'.
# For example: "AWS_*" will log any OS environment variable starting with 'AWS_'.
# This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]"
# Example: log_os_environments: ["AWS_*", "CUDA_VERSION"]
log_os_environments: []
# Development mode worker
worker {
# Status report period in seconds
report_period_sec: 2
# The number of events to report
report_event_flush_threshold: 100
# ping to the server - check connectivity
ping_period_sec: 30
# Log all stdout & stderr
log_stdout: true
# Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
# Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
console_cr_flush_period: 10
# compatibility feature, report memory usage for the entire machine
# default (false), report only on the running process and its sub-processes
report_global_mem_used: false
}
}
By the way, when I run the upload command I get the following error :
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd72e900130>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /
Hi @<1562610699555835904:profile|VirtuousHedgehong97>
I think you need to upgrade your self-hosted clearml-server, could that be the case?
I installed cClearML 1.9 and the error doesn’t show anymore. When I run the code it created the dataset instance on dashboard but it doesn’t upload the files to ClearMl data server from my S3 bucket. Am I doing sth wrong?
suppose I have an S3 bucket where my data is stored and I wish to transfer it to ClearML file server.
Then you first have to download the entire bucket locally, then register the local copy.
Basically:
StorageManager.download_folder("
", "/target/folder")
# now register the local "/target/folder" with Dataset.add_files
also I have:
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server:
web_server:
files_server:
# Credentials are generated using the webapp,
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "***", "secret_key": "***"}
}
I didn’t pass anything for output_uri as I assumed the default is clearml data server
Thanks Martin, so does it mean I won’t be able to see the data hosted on S3 bucket in ClearMl dashboard under datasets tab after registering it?
@<1562610699555835904:profile|VirtuousHedgehong97>
source_url="s3:...",
This means your data is already on S3 bucket, it will not "upload" it it will just register it.
If you want to upload files, then they should be local and then when you call upload you can specify the target S3 bucket, and the data will be stored in a unique folder in the bucket
Does that make sense ?
BTW, when I run dataset = Dataset.create(dataset_name="mydataset", dataset_project="test_project")
, it creates the dataset instance on dashboard. The problem is uploading which doesn’t happen and this error shows up:
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7febe270c340>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /
To expand on this, suppose I have an S3 bucket where my data is stored and I wish to transfer it to ClearML file server. I execute the following Python script
from clearml import Dataset
dataset = Dataset.create(dataset_name="my_dataset", dataset_project="my_project")
dataset.add_external_files(
source_url="
",
dataset_path="/my_dataset/"
)
dataset.upload()
dataset.finalize()
and this is aws part of my clearml.conf
aws {
s3 {
# S3 credentials, used for read/write access by various SDK elements
# The following settings will be used for any bucket not specified below in the "credentials" section
# ---------------------------------------------------------------------------------------------------
region: ""
# Specify explicit keys
key: "AKI***I5"
secret: "2+1yd***2H6y"
# Or enable credentials chain to let Boto3 pick the right credentials.
# This includes picking credentials from environment variables,
# credential file and IAM role using metadata service.
# Refer to the latest Boto3 docs
use_credentials_chain: false
# Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials".
extra_args: {}
# ---------------------------------------------------------------------------------------------------
credentials: [
# specifies key/secret credentials to use when handling s3 urls (read or write)
{
bucket: "my_bucket"
key: "AKI***I5"
secret: "2+1yd***2H6y"
},
# {
# # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
# host: "my-minio-host:9000"
# key: "12345678"
# secret: "12345678"
# multipart: false
# secure: false
# }
]
}
I noticed that while a dataset instance is generated on the ClearML dashboard, the data itself is not uploaded to the ClearML file server. I had assumed that this would be a straightforward process, apparently it’s not!