Reputation
Badges 1
108 × Eureka!Do you start the clearml agents on the server with the same user that has the credentials saved?
I've recently run into this error myself. Did you find any resolution?
@<1523701435869433856:profile|SmugDolphin23> Yes. I'll try it in about 14 hours when I'm back at work and let you know how it goes. 😂
The git credentials are stored in the agent config and they work when I tested them on another project (not for setting up the environment but for downloading the repo of the task itself.)
@<1523701070390366208:profile|CostlyOstrich36> ClearML: 1.10.1, I'm not self-hosting the server so whatever the current version is. Unless you mean the operating system?
@<1523701435869433856:profile|SmugDolphin23> Good to know.
Will this return a list of datasets?
You might want to start with the first steps guide then:
None
Oh, duh. I'll test that out. But I did have the agent.force_git_ssh_protocol: true
Oh, I get what's happening. That segment of the code is rerun when the task is enqueued remotely. So it's deleting itself. This also explains why it works fine locally. It's an ouroboros, the task is deleting itself.
Interesting approach. I'll give that a try. Thanks for the reply!
Actually this is not how it works, pip will install in any way it sees fit, and it is not consistent between versions (it has to do with dependency resolving)
Oh I see. What a pain. 🤣
You can configure the agent to first install specific packages, and only then others, just add the package names here:
That's an interesting solution. I'll keep that in mind as I work more with ClearML.
Thanks for your help Martin!
I figured as much. This is basically what I was planning to do otherwise. I have questions around that.
- It appears that the 'extra' config is displayed in plain text on the web app and downloadable in json. I was just curious if this is best practices.
- I noticed in the AWS instance that's spun up when starting the autoscaler there's 3 settings in the config:
use_credentials_chain: false, use_iam_instance_profile: false, use_owner_token: Falseare these strictly for the credentials t...
1707128614082 bigbrother:gpu0 INFO task 59d23c5919b04fd6947c1e463fa8c78c pulled from 9890a035b8f84872ab18d7ff207c26c6 by worker bigbrother:gpu0
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.vo_oc47r.cfg):
----------------------
agent.worker_id = bigbrother:gpu0
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = ...
@<1523701435869433856:profile|SmugDolphin23> I spoke too soon. It does resolve the error I posted but it introduces a new error. While this error does seem to be related to VS Code the strange thing is it doesn't occur if I run it with earlier versions of clearml .
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/natephysics/.vscode-server/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/_vendo...
Thanks for the reply @<1523701070390366208:profile|CostlyOstrich36> !
It says in the documentation that:
Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded
It seems to recognize the dataset as another version of the data but doesn't seem to be validating the hashes on a per file basis. Also, if you look at the photo, it seems like some of the data does get recognized as the same as the prior data. It seems like it's the correct...
I made a video of the Scheduler config error. You can see that the same code run locally works and doesn't on remote. (I just uploaded the video so the quality might suffer until YT finishes processing the higher resolution versions).
The original file sizes are the same but the compressed sizes seem to be different.
@<1523701435869433856:profile|SmugDolphin23> Yeah, I just wanted to validate it was worth spending the time. Since there is already a parameter that takes callable (i.e. schedule_function ) it might make sense that we reuse the parameter. If it returns a str we validate that it's a task and if it does we can run the task as if we originally passed it as the task_id in .add_task() . This would only be a breaking change if the callable that was passed happened to return a task_id ...
That behavior seems strange. In the pipeline in the clearML pagem if you click on one of the steps and select full details (see attached) you can see the commit ID and the branch. Can you validate that the branch is correct but the commit ID is incorrect?
Results:
I first tried uncommenting enable_git_ask_pass: false but it didn't resolve the issue.
I then cleared the cache in the vcs-cache folder, and that did fix the issue. This is the second time the cache seemed to have been the root cause of the problem. At some point I did move from token-based auth to ssh keys. Would this require clearing the cache for any project that was cached prior to the auth change?
This turns out to be a layer-8 error . task.execute_remotely does work but there was a bug in my code and I wasn't correctly setting the reuse_task flag when run. Sorry to bother the both of you with my mistake.
The plot thickens. It seems like there's something odd going on with the interaction between [LTV] and additional text. If I just search [LTV] it works, if I just search Dataset Test it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.
I have manually verified that the line-by-line content of the csv files is identical using hashlib.sha256(). Why would it be that the file content is the same, they are generated by the same process (literally just rerunning the same code twice) but ClearML treats them differently.
Yes, it indeed appears to be a regex issue. If I run:
Dataset.list_datasets(
dataset_project=self.task.get_project_name(),
partial_name=re.escape('[LTV] Dataset Test'),
only_completed=True,
)
It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets() level? Or go deeper and do it at `Task._query_task...
@<1523701087100473344:profile|SuccessfulKoala55> You wouldn't happen to know what's going on here. :D