Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I'M Running Into A Weird Error When Trying To Clone And Run And Task That Has Completed Successfully. I Have A Test Task That Loads A Dummy Dataset And Trains A Toy Model With Pytorch. When Running Remotely, I Use My Own Docker Image That Has

Hi everyone, I'm running into a weird error when trying to clone and run and task that has completed successfully. I have a test task that loads a dummy dataset and trains a toy model with PyTorch. When running remotely, I use my own docker image that has all of the python packages already installed in a virtual environment. I set this environment variable for the task: -e CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/opt/virtualenvs/clearml-venv/bin/python and also include my repo as a python package to install with task.set_packages(["-e .")]. I am able to use my local VSCode session to run the task remotely on an agent withtask.execute_remotely()` . Everything runs just fine and completes. Then when I try to clone the task and run it again with the agent, I get this error during the requirements parsing:

clearml_agent: Warning: could not resolve python wheel replacement for torch==2.0.1
clearml_agent: ERROR: Could not install task requirements!
invalid python version '' defined in configuration file, key 'agent.default_python': must have both major and minor parts of the version (for example: '3.7')
2024-04-10 11:44:31
Process failed, exit code 1 

The weird thing is that the log also shows that the agent.default_python`` value is set:agent.default_python = 3.10` . Maybe that value needs to be a string instead? But I don't understand why I run into this problem only when cloning. Shouldn't the cloned version run exactly the same as the initial remote run? Thoughts are appreciated!

  
  
Posted 8 months ago
Votes Newest

Answers 8


@<1533620191232004096:profile|NuttyLobster9> it's a bit hard to say and the full log would be very helpful - can you perhaps remove all secrets and send it in a DM so it will not be public in the channel? I assume local paths etc. are less sensitive in a DM

  
  
Posted 8 months ago

Hi @<1523701205467926528:profile|AgitatedDove14> , sure. I just need to scrape them for any sensitive info then i'll post to this thread. Thanks for your reply.

  
  
Posted 8 months ago

Sure thing, anyhow we will fix this bug so next version there is no need for a workaround (but the workaround will still hold so you won't need to change anything)

  
  
Posted 8 months ago

@<1533620191232004096:profile|NuttyLobster9> I think we found the issue, when you are passing a direct link to the python venv, the agent fails to detect the python version and since the python version is required for fetching the correct torch it fails to install it. This is why passing CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none because it skipped resolving the torch / cuda version (that requires parsing the python version)

  
  
Posted 8 months ago

Okay so I discovered that setting -e CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none solves the issue.

That said, if someone could explain to me why this error was occurring and why it only happens in the case of cloning, I'd love to understand. Thanks!

  
  
Posted 8 months ago

Unfortunately, it's turning out to be quite time consuming to manually remove all of the private info in here. Is there a particular section of the log that would be useful to see? I can try to focus on just sharing that part.

  
  
Posted 8 months ago

Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same

  
  
Posted 8 months ago

Hi Martin, I see . That makes sense though I would have expected the behavior to be the same when running remotely the first time as well . In any case, this solved the issue for me . Thanks for looking at it

  
  
Posted 8 months ago
662 Views
8 Answers
8 months ago
8 months ago
Tags
Similar posts