Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
One More Follow-Up Still; We'Re Trying To Run Non-Gpu Scaler, And I'Ve Finally Sorted Out Subnet And Security Groups Issues, Only To Run Into This:

One more follow-up still; we're trying to run non-GPU scaler, and I've finally sorted out subnet and security groups issues, only to run into this:
Executing: ['docker', 'run', '-t', '--gpus', 'all', '-l', 'clearml-worker-id=dynamic_worker:aws_simple:t3a.xlarge:i-002265121f9cc9aec', ... ... docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].We're trying to use the latest ubuntu docker image, and I couldn't find where that docker run command is executed so I could control that flag

EDIT: I see a hidden feature "cpu_only" in resource configuration that could play into that, trying...
EDIT2: Nope, still the same crash

  
  
Posted 2 years ago
Votes Newest

Answers 30


Let me verify a hypothesis...

  
  
Posted 2 years ago

But it should work out of the box ...

Yes it should ....

The user and personal access token are used as is and it propagates down to submodules, since those are simply another git repository.

Can you manually successfully run:
git clone --recursive https://user:token@github.com/company/repo_with_submodules

  
  
Posted 2 years ago

Hi UnevenDolphin73 ,

which agent version are you using? Do you setup the env variable in the agentโ€™s machine too?

  • Can you set env var CLEARML_DOCKER_SKIP_GPUS_FLAG to true?

Regarding this - https://clearml.slack.com/archives/CTK20V944/p1657525402861009?thread_ts=1657291641.224139&cid=CTK20V944 - can you add some more info? maybe the log?

  
  
Posted 2 years ago

UnevenDolphin73
fatal: could not read Username for ' ': terminal prompts disabled .. fatal: clone of ' ' into submodule path '/root/.clearml/vcs-cache/xxx.60db3666b11ac2df511a851e269817ef/xxx/xxx' failedIt seems it tries to clone a submodule and fails due to to missing keys for the submodule.
https://stackoverflow.com/questions/7714326/git-submodule-url-not-including-username
wdyt?

  
  
Posted 2 years ago

TimelyPenguin76 I added pip install --update clearml-agent to the extra_vm_bash_script for the autoscaler, that should at least guarantee the latest clearml agent is used on the instance, right?

  
  
Posted 2 years ago

Follow-up; any ideas how to avoid PEP 517 with the auto scaler?

Takes a

long

time to build the wheels

enable venv caching ?
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L116

  
  
Posted 2 years ago

Odd; switching to virtual environment results in
fatal: could not read Username for ' ': terminal prompts disabledeven though it does earlier show that:
agent.git_user = xxx

  
  
Posted 2 years ago

That's enabled; I was aiming if there are flags to add to pip install CLI, such as --no-use-pep517

  
  
Posted 2 years ago

I can indeed.

  
  
Posted 2 years ago

Iโ€™m using some old agent I fear, since our infra person decided to use chart 3.3.0

That could be the issue, can you update to the latest version so we can check if this is the issue?

Iโ€™ll try with the env var too. Do you personally recommend docker over the simple AMI + virtual environment?

Depends, with docker you know what youll get and you can control many, venv should be quicker and you can set it before

More complete log does not add much information -

Can you send the configuration? can you try it with the latest agent?

  
  
Posted 2 years ago

I just set the git credentials in the

clearml.conf

and it works out of the box

git has issues with passing the user/token from the main repo to the submodules, hence my surprise that it is working out-of-the-box.
Do notice that if you are ussing ssh-key this is a none issue.

Nope, no

.netrc

defined anywhere, ...

If this is the case can you try to add the following to your "extra_vm_bash_script"
echo machine example.com > ~/.netrc && echo login MY_USERNAME >> ~/.netrc && echo password MY_PASSWORD >> ~/.netrc

  
  
Posted 2 years ago

I'm trying, let's see; our infra person is away on holidays :X Thanks! Uh, which configuration exactly would you like to see? We're running using the helm charts on K8s, so I don't think I have direct access to the agent configuration/update it separately?

  
  
Posted 2 years ago

Nope, no .netrc defined anywhere, really (+I've abandoned the use of docker for the autoscaler as it complicates things, at least for now)

  
  
Posted 2 years ago

Sounds like a nice idea ๐Ÿ˜
Follow-up; any ideas how to avoid PEP 517 with the auto scaler? ๐Ÿค” Takes a long time to build the wheels

  
  
Posted 2 years ago

That was a good idea, unfortunately did not help too much, but I think I may have a found a work around, thanks!

  
  
Posted 2 years ago

Now I'm curious what's the workaround ?

  
  
Posted 2 years ago

I'm using some old agent I fear, since our infra person decided to use chart 3.3.0 ๐Ÿ˜•
I'll try with the env var too. Do you personally recommend docker over the simple AMI + virtual environment?

More complete log does not add much information -
Cloning into '/root/.clearml/venvs-builds/3.10/task_repository/xxx/xxx'... fatal: could not read Username for ' ': terminal prompts disabled fatal: clone of ' ' into submodule path '/root/.clearml/venvs-builds/3.10/task_repository/xxx/xxx' failed Failed to clone 'xxx'. Retry scheduled Cloning into '/root/.clearml/venvs-builds/3.10/task_repository/xxx/xxx'... fatal: could not read Username for ' ': terminal prompts disabled fatal: clone of ' ' into submodule path '/root/.clearml/venvs-builds/3.10/task_repository/xxx/xxx' failed Failed to clone 'xxx' a second time, aborting

  
  
Posted 2 years ago

Then the username and password would be visible in the autoscaler task ๐Ÿ˜•
But it should work out of the box, as it does work like that out of the box also regardless of ClearML. The user and personal access token are used as is and it propagates down to submodules, since those are simply another git repository.
I've further checks on a different machine and it works as well ๐Ÿค”

  
  
Posted 2 years ago

We have a read-only user with personal access token for these things, works seamlessly throughout and in our current on premise servers... So perhaps something missing in the autoscaler definitions?

  
  
Posted 2 years ago

Hurrah! Added
git config --system credential.helper 'store --file /root/.git-credentials' to the extra_vm_bash_script and now it works
(logs the given git credentials in the store file, which can then be used immediately for the recursive calls)

  
  
Posted 2 years ago

Different AMI image/installing older Python instances that don't enforce this...
For future reference though, the environment variable should be PIP_USE_PEP517=false

  
  
Posted 2 years ago

git config --system credential.helper 'store --file /root/.git-credentials'

Maybe we should use this hack for cloning with user/token in general ...

  
  
Posted 2 years ago

hmm this might help:
https://pip.pypa.io/en/stable/topics/configuration/#environment-variables
basically you might be able to define:
PIP_NO_USE_PEP517=1

  
  
Posted 2 years ago

I just set the git credentials in the clearml.conf and it works out of the box

  
  
Posted 2 years ago

TimelyPenguin76 here's the full log (took a moment to anonynomize completely):

`
Using environment access key CLEARML_API_ACCESS_KEY=xxx
Using environment secret key CLEARML_API_SECRET_KEY=********
Current configuration (clearml_agent v1.3.0, location: /tmp/.clearml_agent.zs4e7egs.cfg):

sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server =
api.files_server =
api.web_server =
api.credentials.access_key = xxx
api.host =
agent.worker_id = dynamic_worker:aws_simple:t3a.xlarge:xxx
agent.worker_name = xxx
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.venvs_dir = /root/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.clearml/pip-cache
agent.docker_apt_cache = /root/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = /root/.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = /root/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.git_user = xxx
agent.default_python = 3.10
agent.cuda_version = 0
agent.cudnn_version = 0
Executing task id [3a666f3f2b9e46bcb4cd0d59b6a0e3af]:
repository =
branch = main
version_num = 96fd6c76446926333d445cfb4b176e0d29ab8aeb
tag =
docker_cmd =
entry_point = train.py
working_dir = .
Python executable with version '3.7' requested by the Task, not found in path, using '/clearml_agent_venv/bin/python3' (v3.10.4) instead
created virtual environment CPython3.10.4.final.0-64 in 690ms
creator CPython3Posix(dest=/root/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
added seed packages: pip==22.1.2, setuptools==62.6.0, wheel==0.37.1
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
cloning:
fatal: could not read Username for ' ': terminal prompts disabled
fatal: clone of ' ' into submodule path '/root/.clearml/vcs-cache/xxx.60db3666b11ac2df511a851e269817ef/xxx/xxx' failed
Failed to clone 'xxx'. Retry scheduled
fatal: could not read Username for ' ': terminal prompts disabled
fatal: clone of ' ' into submodule path '/root/.clearml/vcs-cache/xxx.60db3666b11ac2df511a851e269817ef/xxx/xxx' failed
Failed to clone 'xxx' a second time, aborting
Repository cloning failed: Command '['clone', ' ', '/root/.clearml/vcs-cache/xxx.60db3666b11ac2df511a851e269817ef/xxx', '--quiet', '--recursive']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository.

  1. Make sure you pushed the requested commit:
    (repository=' ', branch='main', commit_id='96fd6c76446926333d445cfb4b176e0d29ab8aeb', tag='', docker_cmd=None, entry_point='train.py', working_dir='.')
  2. Check if remote-worker has valid credentials [see worker configuration file] `
  
  
Posted 2 years ago

AgitatedDove14 The keys are there, and there is no specifically defined user in .gitmodules :
[submodule "xxx"] path = xxx url =I believe this has to do with how ClearML sets up the git credentials perhaps?

  
  
Posted 2 years ago

I think this is about maybe the credential.helper used

  
  
Posted 2 years ago

works seamlessly throughout and in our current on premise servers...

I'm assuming via something close to what I suggested above with .netrc ?

  
  
Posted 2 years ago

Can you verify by adding the the following to your extra_docker_shell_script:
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L152
extra_docker_shell_script: ["echo machine example.com > ~/.netrc", "echo login MY_USERNAME >> ~/.netrc", "echo password MY_PASSWORD >> ~/.netrc"]

  
  
Posted 2 years ago

NICE!

  
  
Posted 2 years ago
1K Views
30 Answers
2 years ago
one year ago
Tags