Reputation
Badges 1
611 × Eureka!I see. I was just wondering what the general approach is. I think PyTorch used to ship the pip package without CUDA packaged into it. So with conda it was nice to only install CUDA in the environment and not the host. But with pip, you had to use the host version as far as I know.
Yes, I am also talking about agents on different machines. I had two agents on the server machine, which also seem to have been killed. The ones on different machines kept working until 1 or 2 minutes after the clearml-server restarted.
Also clearml-agent at version 1.5 does not look for nightly at the correct indexes even of torch_nightly set to true in clearml.conf
Looking in indexes: https://pypi.org/simple , https://download.pytorch.org/whl/cu117/
Yea, correct! No problem. Uploading such large artifacts as I am doing seems to be an absolute edge case 🙂
Shows some logs, but nothing of relevance I think. Only Infos and Warning about deprecated stuff that is still used ;D ...
Here is some code that shows exactly what goes wrong. I do local execution only. It seems not to be related to remote execution as I thought, but more related to clearml.Task:
` args = parser.parse_args()
print(args) # FIRST OUTPUT
command = args.command
enqueue = args.enqueue
track_remote = args.track_remote
preset_name = args.preset
type_name = args.type
environment_name = args.environment
nvidia_docker = args.nvidia_docker
# Initialize ClearML Tas...
btw: I am pretty sure this used to work, but then stopped work some time ago.
Then I could also do this:# My custom very special use case task = Task() task = task.load_statedict(await Task.load_or_create(task_name)) await task.synchronize() await run_code_analysis() task.add_requirement("myreq") await task.synchronize()
I don't know actually. But Pytorch documentation says it can make a difference: https://pytorch.org/docs/stable/distributions.html#torch.distributions.distribution.Distribution.set_default_validate_args
Yea, the clearml-data is immutable, but not the underlying data if I just store a pointer to some location.
Yea, but before in my original setup the config file was filled. I just added some lines to the config and now the error is back.
Maybe there is something wrong with my setup. Conda confuses me sometimes.
Sounds good. I think it is obvious that immutability has to be managed by the user then, but this is not different from not using clearml-data, so not a disadvantage in my opinion.
Hey, thank you for answering.
I know this issue and I have it sometimes, but my current issue is a direct result of me trying to make SSL work. So I am not asking for help in solving my problem, but only for help how to debug. Finding out which step leads to the artifact not being deleted (e.g. the fileserver cannot be reached by from wherever the delete request is send)
Nono, I got to thank you for this awesome tool!
I got the error again. Seems to happen only when I try to delete "large" experiments.
Very nice!
Maybe for the long-term future you could look into how to make better use of vertical space. Currently, there are 7 (5 in fullscreen mode)= different sections from content to top of the page. Maybe a compact mode would be nice or less space for content headlines.
Now I get:
ollecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
...
I installed my local conda environment from an environment.yml without issues, so maybe clearml makes some changes that leads to conflicts which finally leads to the cpu-version install.
Thank you very much, didnt know about that 🙂
@<1523701087100473344:profile|SuccessfulKoala55> I just did the following (everything locally, not with clearml-agent)
- Set my credentials and S3 endpoint to A
- Run a task with Task.init() and save a debug sample to S3
- Abort the task
- Change my credentials and S3 endpoint to B
- Restart the taskThe result are lingering files in A that seem not to be associated with the task. I would expect ClearML to instead error the task or to track the lingering files somewhere, so they can ma...
I think in the paid version there is this configuration vault, so that the user can pass their own credentials securely to the agent.
I use fixed users!
Perfect, just what I always wanted. Looking forward to the MinIo version. Thank you:)