Thanks AgitatedDove14 , unfortunately it didn't take effect.
I'm also noticing a lot of this while the k8s glue is running.Ex: Expecting value: line 1 column 1 (char 0) K8S Glue pods monitor: Failed parsing kubectl output:
SubstantialElk6 this is odd, how are they passed ? what's the exact setup ?
Hi, this is what i got. No mention of the env variables.
` Current configuration (clearml_agent v0.17.2, location: /home/jax/clearml.conf):
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.web_server =
api.api_server =
api.files_server =
api.credentials.access_key = UKATBKH60Z73SJZIXOIW
api.host =
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = true
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
agent.worker_id =
agent.worker_name = master-node
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version =
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = true
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.package_manager.torch_nightly = false
agent.package_manager.force_repo_requirements_txt = true
agent.package_manager.priority_packages.0 = cython
agent.package_manager.priority_packages.1 = numpy
agent.package_manager.priority_packages.2 = setuptools
agent.venvs_dir = /home/jax/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/jax/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/jax/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = true
agent.docker_pip_cache = /home/jax/.clearml/pip-cache
agent.docker_apt_cache = /home/jax/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.1-runtime-ubuntu18.04
agent.default_docker.arguments.0 = --env GIT_SSL_NO_VERIFY=true
agent.enable_task_env = false
agent.git_user =
agent.default_python = 3.7
agent.cuda_version = 110
agent.cudnn_version = 0
Worker "master-node:0" - Listening to queues:
+----------------------------------+------+-------+
| id | name | tags |
+----------------------------------+------+-------+
| c6f22020435d4fa680e805f530d0078c | gpu | |
+----------------------------------+------+-------+
No tasks in queue c6f22020435d4fa680e805f530d0078c
No tasks in Queues, sleeping for 5.0 seconds `
Hmm yes this is exactly what should not happen 🙂
Let me check it
The --template-yaml allows you to use foll k8s YAML template (the overrides is just overrides, which do not include most of the configuration options. we should probably deprecate it
What's the diff between template-yaml and --overrides-yaml? I used the latter to ensure the gpu is passed in.
Hmm, I think you should use --template-yaml
Can you see that the environment is actually being passed ?
do you use docker ? if yes, then you may want to try modifying extra_docker_shell_script in agent config file
I did another test by runningkubectl exec pod-name -- echo $PIP_INDEX_URL
and it returned nothing. So the env are not passed to the container at all.
Hey SubstantialElk6 ,
Can you show us the top output you get when using the template-yaml instead of overrides-yaml?
So these (PIP_INDEX_URL) weren't used when clearml starts running pip.
See here:
https://pip.pypa.io/en/stable/user_guide/#environment-variables
Pass these environment variables as part of the YAML template you are using with the k8s.
Should work for both 🙂
i passed it through the yaml as follows.apiVersion: v1 kind: Pod spec: containers: - image: clearml-agent:latest" env: - name: PIP_INDEX_URL value: "
" - name: PIP_TRUSTED_HOST value: "192.168.56.253" - name: PIP_FIND_LINKS value: "
" - name: GIT_SSL_NO_VERIFY value: true resources: requests: cpu: "2" memory: "2Gi" limits: nvidia.com/gpu: 1 restartPolicy: Always
This is the top output of python3 k8s_glue_example.py --queue gpu --overrides-yaml custom.yml --namespace default
Found pod container requests=['nvidia.com/gpu=1'] limits=['memory=2Gi', 'cpu=2'] Removing containers section: [{'image': 'clearml-agent:latest"', 'env': [{'name': 'PIP_INDEX_URL', 'value': '
'}, {'name': 'PIP_TRUSTED_HOST', 'value': '192.168.56.253'}, {'name': 'PIP_FIND_LINKS', 'value': '
'}, {'name': 'GIT_SSL_NO_VERIFY', 'value': True}], 'resources': {'requests': {'cpu': '2', 'memory': '2Gi'}, 'limits': {'nvidia.com/gpu': 1}}}] Current configuration (clearml_agent v0.17.2, location: /home/jax/clearml.conf): ----------------------
Do you mean this?Removing containers section: [{'image': 'clearml-agent:latest"', 'env': [{'name': 'PIP_INDEX_URL', 'value': '
'},
SubstantialElk6 is this the pip to install the agent, or the pip the agent is using to install the packages for the specific experiment ?
Hi, i changed it, but it still point to https://files.pythonhosted.org/packages .