Reputation
Badges 1
108 × Eureka!Oh, I get what's happening. That segment of the code is rerun when the task is enqueued remotely. So it's deleting itself. This also explains why it works fine locally. It's an ouroboros, the task is deleting itself.
Interesting approach. I'll give that a try. Thanks for the reply!
Actually this is not how it works, pip will install in any way it sees fit, and it is not consistent between versions (it has to do with dependency resolving)
Oh I see. What a pain. 🤣
You can configure the agent to first install specific packages, and only then others, just add the package names here:
That's an interesting solution. I'll keep that in mind as I work more with ClearML.
Thanks for your help Martin!
I figured as much. This is basically what I was planning to do otherwise. I have questions around that.
- It appears that the 'extra' config is displayed in plain text on the web app and downloadable in json. I was just curious if this is best practices.
- I noticed in the AWS instance that's spun up when starting the autoscaler there's 3 settings in the config:
use_credentials_chain: false, use_iam_instance_profile: false, use_owner_token: Falseare these strictly for the credentials t...
1707128614082 bigbrother:gpu0 INFO task 59d23c5919b04fd6947c1e463fa8c78c pulled from 9890a035b8f84872ab18d7ff207c26c6 by worker bigbrother:gpu0
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.vo_oc47r.cfg):
----------------------
agent.worker_id = bigbrother:gpu0
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = ...
@<1523701435869433856:profile|SmugDolphin23> I spoke too soon. It does resolve the error I posted but it introduces a new error. While this error does seem to be related to VS Code the strange thing is it doesn't occur if I run it with earlier versions of clearml .
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/natephysics/.vscode-server/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/_vendo...
Thanks for the reply @<1523701070390366208:profile|CostlyOstrich36> !
It says in the documentation that:
Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded
It seems to recognize the dataset as another version of the data but doesn't seem to be validating the hashes on a per file basis. Also, if you look at the photo, it seems like some of the data does get recognized as the same as the prior data. It seems like it's the correct...
I made a video of the Scheduler config error. You can see that the same code run locally works and doesn't on remote. (I just uploaded the video so the quality might suffer until YT finishes processing the higher resolution versions).
The original file sizes are the same but the compressed sizes seem to be different.
@<1523701435869433856:profile|SmugDolphin23> Yeah, I just wanted to validate it was worth spending the time. Since there is already a parameter that takes callable (i.e. schedule_function ) it might make sense that we reuse the parameter. If it returns a str we validate that it's a task and if it does we can run the task as if we originally passed it as the task_id in .add_task() . This would only be a breaking change if the callable that was passed happened to return a task_id ...
That behavior seems strange. In the pipeline in the clearML pagem if you click on one of the steps and select full details (see attached) you can see the commit ID and the branch. Can you validate that the branch is correct but the commit ID is incorrect?
Results:
I first tried uncommenting enable_git_ask_pass: false but it didn't resolve the issue.
I then cleared the cache in the vcs-cache folder, and that did fix the issue. This is the second time the cache seemed to have been the root cause of the problem. At some point I did move from token-based auth to ssh keys. Would this require clearing the cache for any project that was cached prior to the auth change?
This turns out to be a layer-8 error . task.execute_remotely does work but there was a bug in my code and I wasn't correctly setting the reuse_task flag when run. Sorry to bother the both of you with my mistake.
The plot thickens. It seems like there's something odd going on with the interaction between [LTV] and additional text. If I just search [LTV] it works, if I just search Dataset Test it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.
I have manually verified that the line-by-line content of the csv files is identical using hashlib.sha256(). Why would it be that the file content is the same, they are generated by the same process (literally just rerunning the same code twice) but ClearML treats them differently.
Yes, it indeed appears to be a regex issue. If I run:
Dataset.list_datasets(
dataset_project=self.task.get_project_name(),
partial_name=re.escape('[LTV] Dataset Test'),
only_completed=True,
)
It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets() level? Or go deeper and do it at `Task._query_task...
That make sense. I was confused what the source was.
Stability is for wimps. I live on the edge of brining down production at any moment, like a real developer. But thanks for the update! 🙃
@<1539780284646428672:profile|PoisedElephant79> Sorry for not getting back with this sooner. Dataset.get() doesn't work like you suggested. In the documentation it's clear:
Get a
specific
Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned. If no semantic version is found, the most recently updated dataset is returned. This functions raises an Exception in case no dataset can be found and the
auto_create=True
...
After some digging we found it was actually caused by the routers IPS protection. I thought it would be strange for github to be throttling things at this scale.
I see. Thanks for the insight. That seems to be the case. I'm struggling a bit with datasets. For example, if I wanted to trace the genealogy of a dataset that's used by traditional tasks and pipelines. I'll try and write something up about the challenges around that when I get the chance. But your comment revealed another issue:
It appears that the partial name matching isn't going well. I'm unclear why this wouldn't be matching. In the attached photo you can see the input for `partial_nam...
@<1523701205467926528:profile|AgitatedDove14>
And the Task is still running? What's he clearml python version and webui version ?
No, the task stops (it's running remote, I haven't tested it running local).
Thanks Eugen for the quick reply. If I can add a suggestion/comment from my perspective: Why is schedule_function included in the .add_task() method? As far as I can tell if you use schedule_function it changes the very nature of the method, it's no longer adding a task but adding a function . It seems like it would make more sense if this was broken into something like an .add_function() method. Also, if you call schedule_function many of the other parameters in `.add...
In this case it's the ID of the "output" model from the first task.
This does appear to resolve the issue. I'll keep you updated if I find any other issues. Thanks @<1523701435869433856:profile|SmugDolphin23>
Sure. I'm in Europe but we can also test things async.
In the debugger I can see that before starting the scheduler the test task is added:
ScheduleJob(name='Snitch-TaskScheduler', base_task_id='', base_function=<function main.<locals>.scheduler_function.<locals>.<lambda> at 0x7f05e1ab3600>, queue='services', target_project='DevOps', single_instance=False, task_parameters=None, task_overrides=None, clone_task=True, _executed_instances=None, execution_limit_hours=None, recurring=True, starting_time=datetime.datetime(2024, 1, 17, 10, 50, 28,...