So in the k8s glue agent deployment, the clearml.conf is just:sdk {} agent { package_manager: { extra_index_url: ["host"] } }
(the API keys are exposed through environment variables)
but I set up only the apiserver fileserver and webserver hosts, and the access keys... the rest is what is produced by clearml-init
But shouldn't the path of the artifacts be a setting of the file server, and not of the agent?
And this is the list of variables defined in the K8SGlue pod:CLEARML_REDIS_MASTER_PORT_6379_TCP_PROTO CLEARML_REDIS_MASTER_SERVICE_HOST CLEARML_REDIS_MASTER_PORT CLEARML_MONGODB_PORT_27017_TCP CLEARML_ELASTIC_MASTER_PORT_9300_TCP_PROTO CLEARML_WEBSERVER_SERVICE_HOST K8S_GLUE_EXTRA_ARGS CLEARML_ELASTIC_MASTER_PORT_9300_TCP_PORT CLEARML_FILESERVER_PORT_8081_TCP_PROTO HOSTNAME CLEARML_MONGODB_PORT_27017_TCP_PORT CLEARML_MONGODB_PORT CLEARML_ELASTIC_MASTER_SERVICE_PORT CLEARML_FILESERVER_PORT_8081_TCP_PORT CLEARML_REDIS_MASTER_PORT_6379_TCP_PORT CLEARML_ELASTIC_MASTER_PORT_9200_TCP CLEARML_MONGODB_SERVICE_HOST CLEARML_WEBSERVER_PORT K8S_DEFAULT_NAMESPACE CLEARML_APISERVER_PORT_8008_TCP CLEARML_ELASTIC_MASTER_PORT_9200_TCP_PROTO CLEARML_MONGODB_SERVICE_PORT_MONGO_SERVICE CLEARML_MONGODB_PORT_27017_TCP_ADDR CLEARML_AGENT_UPDATE_REPO CLEARML_API_HOST CLEARML_APISERVER_PORT_8008_TCP_ADDR CLEARML_ELASTIC_MASTER_PORT_9300_TCP_ADDR FORCE_CLEARML_AGENT_REPO CLEARML_APISERVER_SERVICE_PORT CLEARML_REDIS_MASTER_PORT_6379_TCP CLEARML_FILES_HOST CLEARML_FILESERVER_PORT_8081_TCP_ADDR CLEARML_MONGODB_SERVICE_PORT CLEARML_DOCKER_IMAGE CLEARML_API_ACCESS_KEY CLEARML_API_SECRET_KEY CLEARML_REDIS_MASTER_PORT_6379_TCP_ADDR CLEARML_ELASTIC_MASTER_PORT_9200_TCP_PORT CLEARML_ELASTIC_MASTER_PORT CLEARML_WEB_HOST CLEARML_REDIS_MASTER_SERVICE_PORT_REDIS CLEARML_FILESERVER_SERVICE_HOST CLEARML_MONGODB_PORT_27017_TCP_PROTO CLEARML_APISERVER_SERVICE_HOST CLEARML_FILESERVER_SERVICE_PORT CLEARML_WEBSERVER_PORT_80_TCP_PORT CLEARML_ELASTIC_MASTER_SERVICE_PORT_TRANSPORT CLEARML_ELASTIC_MASTER_PORT_9300_TCP CLEARML_APISERVER_PORT_8008_TCP_PORT CLEARML_WEBSERVER_PORT_80_TCP_ADDR CLEARML_WORKER_ID CLEARML_FILESERVER_PORT_8081_TCP CLEARML_WEBSERVER_PORT_80_TCP_PROTO CLEARML_FILESERVER_PORT CLEARML_APISERVER_PORT_8008_TCP_PROTO CLEARML_APISERVER_PORT CLEARML_ELASTIC_MASTER_SERVICE_PORT_HTTP CLEARML_REDIS_MASTER_SERVICE_PORT CLEARML_ELASTIC_MASTER_SERVICE_HOST CLEARML_ELASTIC_MASTER_PORT_9200_TCP_ADDR CLEARML_WEBSERVER_SERVICE_PORT CLEARML_WEBSERVER_PORT_80_TCP
Absolute sense! Thanks a lot Martin, I thought it was being done by the backend!
If you need to know the value of some of them let me know CostlyOstrich36 I wanted to avoid leaking access keys etc. so I removed the values
OK, it wasn't the clearml.conf settings...
In the deployment I was referring to the fileserver, apiserver, etc. with the internal kubernetes dns names.
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
But I can't really figure out why that would be the case...
(the API keys are exposed through environment variables)
Where are the env variables pointing? I'm interested in all CLEARML related env vars if you could add them here 🙂
Hi Josh, the agents are running on top of K8s (I used the helm chart to deploy them, it uses K8s glue).
I'll add a sleep so that I have time to enter the pod, and get the clearml.conf and will send you the diff in a few minutes
My local clearml.conf is:# ClearML SDK configuration file api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: host web_server: host files_server: host # Credentials are generated using the webapp,
`
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "access_key", "secret_key": "secret_key"}
}
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/cache"
# default_cache_manager_size: 100
}
direct_access: [
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
# or cached, and any download request will return a direct reference.
# Objects are specified in glob format, available for url and content_type.
{ url: "file://*" } # file-urls are always directly referenced
]
}
metrics {
# History size for debug files per metric/variant. For each metric/variant combination with an attached file
# (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
# X files are stored in the upload destination for each metric/variant combination.
file_history_size: 100
# Max history size for matplotlib imshow files per plot title.
# File names for the uploaded images will be recycled in such a way that no more than
# X images are stored in the upload destination for each matplotlib plot title.
matplotlib_untitled_history_size: 100
# Limit the number of digits after the dot in plot reporting (reducing plot report size)
# plot_max_num_digits: 5
# Settings for generated debug images
images {
format: JPEG
quality: 87
subsampling: 0
}
# Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
tensorboard_single_series_per_graph: false
}
network {
metrics {
# Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
# a specific iteration
file_upload_threads: 4
# Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
# being sent for upload
file_upload_starvation_warning_sec: 120
}
iteration {
# Max number of retries when getting frames if the server returned an error (http code 500)
max_retries_on_server_error: 5
# Backoff factory for consecutive retry attempts.
# SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
retry_backoff_factor_sec: 10
}
}
aws {
s3 {
# S3 credentials, used for read/write access by various SDK elements
# Default, used for any bucket not specified below
region: ""
# Specify explicit keys
key: ""
secret: ""
# Or enable credentials chain to let Boto3 pick the right credentials.
# This includes picking credentials from environment variables,
# credential file and IAM role using metadata service.
# Refer to the latest Boto3 docs
use_credentials_chain: false
credentials: [
# specifies key/secret credentials to use when handling s3 urls (read or write)
# {
# bucket: "my-bucket-name"
# key: "my-access-key"
# secret: "my-secret-key"
# },
# {
# # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
# host: "my-minio-host:9000"
# key: "12345678"
# secret: "12345678"
# multipart: false
# secure: false
# }
]
}
boto3 {
pool_connections: 512
max_multipart_concurrency: 16
}
}
google.storage {
# # Default project and credentials file
# # Will be used when no bucket configuration is found
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# pool_connections: 512
# pool_maxsize: 1024
# # Specific credentials per bucket and sub directory
# credentials = [
# {
# bucket: "my-bucket"
# subdir: "path/in/bucket" # Not required
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# },
# ]
}
azure.storage {
# containers: [
# {
# account_name: "clearml"
# account_key: "secret"
# # container_name:
# }
# ]
}
log {
# debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
null_log_propagate: false
task_log_buffer_capacity: 66
# disable urllib info and lower levels
disable_urllib3_info: true
}
development {
# Development-mode options
# dev task reuse window
task_reuse_time_window_in_hours: 72.0
# Run VCS repository detection asynchronously
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
support_stopping: true
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
default_output_uri: ""
# Default auto generated requirements optimize for smaller requirements
# If True, analyze the entire repository regardless of the entry point.
# If False, first analyze the entry point script, if it does not contain other to local files,
# do not analyze the entire repository.
force_analyze_entire_repo: false
# If set to true, *clearml* update message will not be printed to the console
# this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
suppress_update_message: false
# If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
detect_with_pip_freeze: false
# Log specific environment variables. OS environments are listed in the "Environment" section
# of the Hyper-Parameters.
# multiple selected variables are supported including the suffix '*'.
# For example: "AWS_*" will log any OS environment variable starting with 'AWS_'.
# This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]"
# Example: log_os_environments: ["AWS_*", "CUDA_VERSION"]
log_os_environments: []
# Development mode worker
worker {
# Status report period in seconds
report_period_sec: 2
# ping to the server - check connectivity
ping_period_sec: 30
# Log all stdout & stderr
log_stdout: true
# Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
# Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
console_cr_flush_period: 10
# compatibility feature, report memory usage for the entire machine
# default (false), report only on the running process and its sub-processes
report_global_mem_used: false
}
}
# Apply top-level environment section from configuration into os.environ
apply_environment: false
# Top-level environment section is in the form of:
# environment {
# key: value
# ...
# }
# and is applied to the OS environment as `key=value` for each key/value pair
# Apply top-level files section from configuration into local file system
apply_files: false
# Top-level files section allows auto-generating files at designated paths with a predefined contents
# and target format. Options include:
# contents: the target file's content, typically a string (or any base type int/float/list/dict etc.)
# format: a custom format for the contents. Currently supported value is `base64` to automatically decode a
# base64-encoded contents string, otherwise ignored
# path: the target file's path, may include ~ and inplace env vars
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# Example:
# files {
# myfile1 {
# contents: "The quick brown fox jumped over the lazy dog"
# path: "/tmp/fox.txt"
# }
# myjsonfile {
# contents: {
# some {
# nested {
# value: [1, 2, 3, 4]
# }
# }
# }
# path: "/tmp/test.json"
# target_format: json
# }
# }
} `
OK. In the pod spawned by the K8s Glue Agent, clearml.conf is the same as the K8S Glue Agent
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
Nice!
But I can't really figure out why that would be the case...
So the thing is, the link to the files are generated by the clients, which means the actual code generated a link an internal link to the file server (i.e. a link that only works inside the k8s cluster). When you wanted to see the image/plot you were accessing it from outside the cluster, and the link simply did not work (I assume that if would open the web dev panel om the browser you would see the request failing).
Make sense ?
Hi SarcasticSquirrel56 ,
How are the agents running? On top of K8s or bare metal?
Also, can you do a diff between the ~/clearml.conf
of your local machine and the one on the agent?
Not really 🙂
They files are clearly different, but if I understand correctly is it enough to add
` storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/cache"
# default_cache_manager_size: 100
}
direct_access: [
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
# or cached, and any download request will return a direct reference.
# Objects are specified in glob format, available for url and content_type.
{ url: "file://*" } # file-urls are always directly referenced
]
}
metrics {
# History size for debug files per metric/variant. For each metric/variant combination with an attached file
# (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
# X files are stored in the upload destination for each metric/variant combination.
file_history_size: 100
# Max history size for matplotlib imshow files per plot title.
# File names for the uploaded images will be recycled in such a way that no more than
# X images are stored in the upload destination for each matplotlib plot title.
matplotlib_untitled_history_size: 100
# Limit the number of digits after the dot in plot reporting (reducing plot report size)
# plot_max_num_digits: 5
# Settings for generated debug images
images {
format: JPEG
quality: 87
subsampling: 0
}
# Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
tensorboard_single_series_per_graph: false
}
network {
metrics {
# Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
# a specific iteration
file_upload_threads: 4
# Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
# being sent for upload
file_upload_starvation_warning_sec: 120
}
iteration {
# Max number of retries when getting frames if the server returned an error (http code 500)
max_retries_on_server_error: 5
# Backoff factory for consecutive retry attempts.
# SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
retry_backoff_factor_sec: 10
}
} `in the sdk section of the clearml.conf file on the agent?
Also, please go into the UI - go to the experiment that was executed remotely. Open developer tools (F12) and see what is returned when you navigate to the plots page in the UI
This is the list of all the environment variables (starting with CLEARML) available in the Pod spawned by the K8s Glue Agent:CLEARML_MONGODB_PORT_27017_TCP_PORT CLEARML_FILESERVER_PORT_8081_TCP_ADDR CLEARML_ELASTIC_MASTER_PORT_9200_TCP CLEARML_APISERVER_PORT_8008_TCP_PROTO CLEARML_FILESERVER_PORT_8081_TCP_PORT CLEARML_ELASTIC_MASTER_SERVICE_PORT_TRANSPORT CLEARML_WEBSERVER_PORT_80_TCP CLEARML_ELASTIC_MASTER_SERVICE_PORT CLEARML_MONGODB_PORT_27017_TCP_ADDR CLEARML_FILESERVER_PORT_8081_TCP_PROTO CLEARML_FILESERVER_SERVICE_HOST CLEARML_APISERVER_PORT_8008_TCP_ADDR CLEARML_FILESERVER_PORT CLEARML_AGENT_GIT_PASS CLEARML_REDIS_MASTER_PORT_6379_TCP_ADDR CLEARML_MONGODB_SERVICE_PORT CLEARML_AGENT_GIT_USER CLEARML_REDIS_MASTER_PORT CLEARML_API_HOST CLEARML_REDIS_MASTER_SERVICE_HOST CLEARML_REDIS_MASTER_PORT_6379_TCP_PORT CLEARML_MONGODB_SERVICE_HOST CLEARML_WEBSERVER_SERVICE_HOST CLEARML_WEBSERVER_PORT CLEARML_ELASTIC_MASTER_PORT_9200_TCP_ADDR CLEARML_MONGODB_PORT_27017_TCP CLEARML_ELASTIC_MASTER_PORT CLEARML_REDIS_MASTER_SERVICE_PORT_REDIS CLEARML_ELASTIC_MASTER_SERVICE_HOST CLEARML_ELASTIC_MASTER_PORT_9300_TCP_PROTO CLEARML_APISERVER_SERVICE_HOST CLEARML_WEBSERVER_PORT_80_TCP_PORT CLEARML_APISERVER_PORT_8008_TCP_PORT CLEARML_MONGODB_PORT_27017_TCP_PROTO CLEARML_REDIS_MASTER_SERVICE_PORT CLEARML_APISERVER_PORT CLEARML_FILES_HOST CLEARML_WEB_HOST CLEARML_FILESERVER_SERVICE_PORT CLEARML_ELASTIC_MASTER_PORT_9200_TCP_PORT CLEARML_ELASTIC_MASTER_PORT_9300_TCP_PORT CLEARML_FILESERVER_PORT_8081_TCP CLEARML_REDIS_MASTER_PORT_6379_TCP CLEARML_MONGODB_SERVICE_PORT_MONGO_SERVICE CLEARML_API_SECRET_KEY CLEARML_API_ACCESS_KEY CLEARML_WEBSERVER_SERVICE_PORT CLEARML_MONGODB_PORT CLEARML_APISERVER_PORT_8008_TCP CLEARML_APISERVER_SERVICE_PORT CLEARML_WEBSERVER_PORT_80_TCP_ADDR CLEARML_WEBSERVER_PORT_80_TCP_PROTO CLEARML_ELASTIC_MASTER_SERVICE_PORT_HTTP CLEARML_ELASTIC_MASTER_PORT_9300_TCP CLEARML_ELASTIC_MASTER_PORT_9300_TCP_ADDR CLEARML_REDIS_MASTER_PORT_6379_TCP_PROTO CLEARML_ELASTIC_MASTER_PORT_9200_TCP_PROTO
Hi SarcasticSquirrel56
But if I then clone the task, and execute it by sending it to a queue, the experiment succeeds,
I'm assuming that on the remote machine the "files_server" is not configured the same way as the local execution. for example it points to an S3 bucket the credentials for the bucket are missing.
(in your specific example I'm assuming that the plot is non-interactive which means this is actually a PNG stored somewhere, usually the file-server configuration). Does that make sense ?