Actually this is not how it works, pip will install in any way it sees fit, and it is not consistent between versions (it has to do with dependency resolving)
Oh I see. What a pain. 🤣
You can configure the agent to first install specific packages, and only then others, just add the package names here:
That's an interesting solution. I'll keep that in mind as I work more with ClearML.
Thanks for your help Martin!
I figured as much. This is basically what I was planning to do otherwise. I have questions around that.
- It appears that the 'extra' config is displayed in plain text on the web app and downloadable in json. I was just curious if this is best practices.
- I noticed in the AWS instance that's spun up when starting the autoscaler there's 3 settings in the config:
use_credentials_chain: false, use_iam_instance_profile: false, use_owner_token: Falseare these strictly for the credentials t...
1707128614082 bigbrother:gpu0 INFO task 59d23c5919b04fd6947c1e463fa8c78c pulled from 9890a035b8f84872ab18d7ff207c26c6 by worker bigbrother:gpu0
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.vo_oc47r.cfg):
----------------------
agent.worker_id = bigbrother:gpu0
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = ...
@<1523701435869433856:profile|SmugDolphin23> I spoke too soon. It does resolve the error I posted but it introduces a new error. While this error does seem to be related to VS Code the strange thing is it doesn't occur if I run it with earlier versions of clearml .
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/natephysics/.vscode-server/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/_vendo...
I made a video of the Scheduler config error. You can see that the same code run locally works and doesn't on remote. (I just uploaded the video so the quality might suffer until YT finishes processing the higher resolution versions).
The original file sizes are the same but the compressed sizes seem to be different.
@<1523701435869433856:profile|SmugDolphin23> Yeah, I just wanted to validate it was worth spending the time. Since there is already a parameter that takes callable (i.e. schedule_function ) it might make sense that we reuse the parameter. If it returns a str we validate that it's a task and if it does we can run the task as if we originally passed it as the task_id in .add_task() . This would only be a breaking change if the callable that was passed happened to return a task_id ...
That behavior seems strange. In the pipeline in the clearML pagem if you click on one of the steps and select full details (see attached) you can see the commit ID and the branch. Can you validate that the branch is correct but the commit ID is incorrect?
Results:
I first tried uncommenting enable_git_ask_pass: false but it didn't resolve the issue.
I then cleared the cache in the vcs-cache folder, and that did fix the issue. This is the second time the cache seemed to have been the root cause of the problem. At some point I did move from token-based auth to ssh keys. Would this require clearing the cache for any project that was cached prior to the auth change?
This turns out to be a layer-8 error . task.execute_remotely does work but there was a bug in my code and I wasn't correctly setting the reuse_task flag when run. Sorry to bother the both of you with my mistake.
The plot thickens. It seems like there's something odd going on with the interaction between [LTV] and additional text. If I just search [LTV] it works, if I just search Dataset Test it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.
I have manually verified that the line-by-line content of the csv files is identical using hashlib.sha256(). Why would it be that the file content is the same, they are generated by the same process (literally just rerunning the same code twice) but ClearML treats them differently.
Yes, it indeed appears to be a regex issue. If I run:
Dataset.list_datasets(
dataset_project=self.task.get_project_name(),
partial_name=re.escape('[LTV] Dataset Test'),
only_completed=True,
)
It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets() level? Or go deeper and do it at `Task._query_task...
That make sense. I was confused what the source was.
Stability is for wimps. I live on the edge of brining down production at any moment, like a real developer. But thanks for the update! 🙃
@<1539780284646428672:profile|PoisedElephant79> Sorry for not getting back with this sooner. Dataset.get() doesn't work like you suggested. In the documentation it's clear:
Get a
specific
Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned. If no semantic version is found, the most recently updated dataset is returned. This functions raises an Exception in case no dataset can be found and the
auto_create=True
...
After some digging we found it was actually caused by the routers IPS protection. I thought it would be strange for github to be throttling things at this scale.
I see. Thanks for the insight. That seems to be the case. I'm struggling a bit with datasets. For example, if I wanted to trace the genealogy of a dataset that's used by traditional tasks and pipelines. I'll try and write something up about the challenges around that when I get the chance. But your comment revealed another issue:
It appears that the partial name matching isn't going well. I'm unclear why this wouldn't be matching. In the attached photo you can see the input for `partial_nam...
@<1523701205467926528:profile|AgitatedDove14>
And the Task is still running? What's he clearml python version and webui version ?
No, the task stops (it's running remote, I haven't tested it running local).
Thanks Eugen for the quick reply. If I can add a suggestion/comment from my perspective: Why is schedule_function included in the .add_task() method? As far as I can tell if you use schedule_function it changes the very nature of the method, it's no longer adding a task but adding a function . It seems like it would make more sense if this was broken into something like an .add_function() method. Also, if you call schedule_function many of the other parameters in `.add...
In this case it's the ID of the "output" model from the first task.
This does appear to resolve the issue. I'll keep you updated if I find any other issues. Thanks @<1523701435869433856:profile|SmugDolphin23>
Sure. I'm in Europe but we can also test things async.
In the debugger I can see that before starting the scheduler the test task is added:
ScheduleJob(name='Snitch-TaskScheduler', base_task_id='', base_function=<function main.<locals>.scheduler_function.<locals>.<lambda> at 0x7f05e1ab3600>, queue='services', target_project='DevOps', single_instance=False, task_parameters=None, task_overrides=None, clone_task=True, _executed_instances=None, execution_limit_hours=None, recurring=True, starting_time=datetime.datetime(2024, 1, 17, 10, 50, 28,...
Strange, the code seems to work perfectly when I run it locally. To make it more confusing, the queue that I enqueue it to when I run it remotely is using agents from the same server that I'm running it locally from.
Are you self hosting a ClearML server?
Since this could happen with a lot of services, maybe it would be worth a retry option? Especially if it's part of a pipeline.
Ah, I think I see the issue. In my head I was crossing ID with URL.
Project 2:
2024-01-22 17:21:56
task 6518c3cd13394aa4abbc8f0dc34eb763 pulled from 8a69a982f5824762aeac7b000fbf2161 by worker bigbrother:10
2024-01-22 17:22:03
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.bojpliyx.cfg):
----------------------
agent.worker_id = bigbrother:10
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_v...