
Reputation
Badges 1
108 × Eureka!This turns out to be a layer-8 error . task.execute_remotely
does work but there was a bug in my code and I wasn't correctly setting the reuse_task
flag when run. Sorry to bother the both of you with my mistake.
Maybe the sleep between scheduler.mark_completed()
and scheduler.delete()
is too short? But I don't get why deleting the old scheduler task would break the new scheduler. I'm going to try testing by running the scheduler locally.
Strange, the code seems to work perfectly when I run it locally. To make it more confusing, the queue that I enqueue it to when I run it remotely is using agents from the same server that I'm running it locally from.
Is there currently a way to bind the same GPU to multiple queues? I believe the agent complains last time I tried (which was a bit ago).
That's great! I look forward to trying this out.
@<1523701205467926528:profile|AgitatedDove14>
And the Task is still running? What's he clearml python version and webui version ?
No, the task stops (it's running remote, I haven't tested it running local).
It seems that the error is related to this part of the code block. However, when I comment this out I get the error I had 2 days ago with the missing configuration object.
I think the PR is a good idea. I read the contribution guidelines. It talks about referencing an issue. Did you want me to duplicate this issue on the repo or is it enough to link to this thread?
No error. Just a new task each time.
Interesting approach. I'll give that a try. Thanks for the reply!
Hi again @<1523701435869433856:profile|SmugDolphin23> ,
The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:
Code:
if self.task:
# get the parent dataset from the project
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project=...
This is odd, the ordering of the files is different and there appears to be some missing from the preview. But as far as I can tell the files aren't different. What am I missing here?
That behavior seems strange. In the pipeline in the clearML pagem if you click on one of the steps and select full details (see attached) you can see the commit ID and the branch. Can you validate that the branch is correct but the commit ID is incorrect?
I have manually verified that the line-by-line content of the csv files is identical using hashlib.sha256(). Why would it be that the file content is the same, they are generated by the same process (literally just rerunning the same code twice) but ClearML treats them differently.
There is no issues when I run the "raw" script. Also, since it's based on tasks, the code must have run without fault for it to be pulled as a task in the pipeline.
As for when it fails, looking at the log here it looks like it's on the first task or maybe as the first task is launching. But I'd have to go back to be sure. I rolled back to 1.13.1 and that's working fine. But, if you want I can help explore this bug in detail because it would be nice to find the root of the issue. LmK what y...
Thanks for your reply @<1523701070390366208:profile|CostlyOstrich36> Is there an example where a pipeline is built from existing tasks? I'd like to experiment with it and I don' t see any examples of what you describe with my (clearly lacking) google-fu. What happens if you wrap a function with a task.init() with a pipeline decorator or is that the process you're speaking of?
I'm using pro. Sorry, for the delay, I didn't notice I never sent the response.
@<1523701070390366208:profile|CostlyOstrich36> ClearML: 1.10.1, I'm not self-hosting the server so whatever the current version is. Unless you mean the operating system?
@<1523701435869433856:profile|SmugDolphin23> Good to know.
@<1523701435869433856:profile|SmugDolphin23> I spoke too soon. It does resolve the error I posted but it introduces a new error. While this error does seem to be related to VS Code the strange thing is it doesn't occur if I run it with earlier versions of clearml
.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/natephysics/.vscode-server/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/_vendo...
I found I was having this issue as well. I don't have an alias defined in the pipeline but in a task and I get the same error. I'm not hosting my own server but using the free web service at the moment.
Let me give that a try. Thanks for all the help.
This doesn't really make a lot of sense. ClearML would be better served for tracking which version of the code you used for a corresponding task and you'd use something like github or gitlab to track code and host your code. You could use ClearML to help you reconstruct the environment and code from a task given it's being tracked by git and hosted somewhere you can access.
I'm aware of that but it doesn't help this situation.
After some digging we found it was actually caused by the routers IPS protection. I thought it would be strange for github to be throttling things at this scale.
It's a corporate one. We are also looking into options on Github's end.
1707128614082 bigbrother:gpu0 INFO task 59d23c5919b04fd6947c1e463fa8c78c pulled from 9890a035b8f84872ab18d7ff207c26c6 by worker bigbrother:gpu0
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.vo_oc47r.cfg):
----------------------
agent.worker_id = bigbrother:gpu0
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = ...
It did update but I'm beginning to see the issue now. It seems like the metrics on thousands of experiments amounted to a few MB where deleting one of the hyperparameter experiments freed up over a gig. I'm having difficulty seeing why one of these experiments occupies so much metric space. From the hyperparameter optimization dashboard the graphs and the tables might have a few thousand points.
Can you help me:
- Better understanding why it's occupying so much metric storage space
- Is the...