Project 2:
2024-01-22 17:21:56
task 6518c3cd13394aa4abbc8f0dc34eb763 pulled from 8a69a982f5824762aeac7b000fbf2161 by worker bigbrother:10
2024-01-22 17:22:03
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.bojpliyx.cfg):
----------------------
agent.worker_id = bigbrother:10
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_v...
It's a corporate one. We are also looking into options on Github's end.
Since this could happen with a lot of services, maybe it would be worth a retry option? Especially if it's part of a pipeline.
Thanks Eugen for the quick reply. If I can add a suggestion/comment from my perspective: Why is schedule_function
included in the .add_task()
method? As far as I can tell if you use schedule_function
it changes the very nature of the method, it's no longer adding a task but adding a function . It seems like it would make more sense if this was broken into something like an .add_function()
method. Also, if you call schedule_function
many of the other parameters in `.add...
After some digging we found it was actually caused by the routers IPS protection. I thought it would be strange for github to be throttling things at this scale.
Provide a bit more detail. What framework are you using?
Let me give that a try. Thanks for all the help.
Interesting approach. I'll give that a try. Thanks for the reply!
Hi again @<1523701435869433856:profile|SmugDolphin23> ,
The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:
Code:
if self.task:
# get the parent dataset from the project
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project=...
It hooks into the calls made by the code. If you never save the model to disk, add it to a tool like MLflow/Tensorboard, or manually add the artifact to ClearML, afaik it won't save the artifact.
I'm not sure why the logs were incomplete. I think part of the reason it wasn't pulling from the repo was that it was pulling from cache. I cleared the clearml cache for that project and reran it. This should be the full log.
Actually, clearing the cache on the other project might have fixed it. I just tested it out and it seems to be working.
Awesome! Did you managed to solve the tailscale issue with ClearML sessions? Sorry I wasn't active with that. I don't use sessions often and I found a suitable alternative in the short time. Any hopes of the changes making their way to a PR for the official release?
I'd like to provide the credentials to any ec2 instances that are spun up.
I'm using pro. Sorry, for the delay, I didn't notice I never sent the response.
@<1523701070390366208:profile|CostlyOstrich36> Just pinging you 😄
I figured as much. This is basically what I was planning to do otherwise. I have questions around that.
- It appears that the 'extra' config is displayed in plain text on the web app and downloadable in json. I was just curious if this is best practices.
- I noticed in the AWS instance that's spun up when starting the autoscaler there's 3 settings in the config:
use_credentials_chain: false, use_iam_instance_profile: false, use_owner_token: False
are these strictly for the credentials t...
Hi Jake 👍 ,
Maybe the content is cached? The repo isn't big. I didn't realize the log was missing content. I believe I copied everything but I'll double check in a moment.
Yes, it indeed appears to be a regex issue. If I run:
Dataset.list_datasets(
dataset_project=self.task.get_project_name(),
partial_name=re.escape('[LTV] Dataset Test'),
only_completed=True,
)
It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets()
level? Or go deeper and do it at `Task._query_task...
I figured you'd say that so I went ahead with that PR. I got it working but I'm going to test it a bit further.
I'm aware of that but it doesn't help this situation.
@<1523701435869433856:profile|SmugDolphin23> Yeah, I just wanted to validate it was worth spending the time. Since there is already a parameter that takes callable (i.e. schedule_function
) it might make sense that we reuse the parameter. If it returns a str we validate that it's a task and if it does we can run the task as if we originally passed it as the task_id
in .add_task()
. This would only be a breaking change if the callable that was passed happened to return a task_id
...
Sure. I'm in Europe but we can also test things async.
I actually ran into the exact same problem. The agents aren't hosted on AWS though, just a in-house server.
The git credentials are stored in the agent config and they work when I tested them on another project (not for setting up the environment but for downloading the repo of the task itself.)
Why? That's not how I authenticate. Also, if it was simply an issue with authentication wouldn't there be some error message in the log?
@<1523701087100473344:profile|SuccessfulKoala55> You wouldn't happen to know what's going on here. :D
It's verbatim from requirements as I pass that into ClearML.
Oh, I get what's happening. That segment of the code is rerun when the task is enqueued remotely. So it's deleting itself. This also explains why it works fine locally. It's an ouroboros, the task is deleting itself.