Reputation
Badges 1
29 × Eureka!task.get_parameters
and task.get_parameters_as_dict
have the keyword argument cast
which attempts to convert values back to their original type, but interestingly doesn't seem to work for properties:
` task = Task.init()
task.set_user_properties(x=5)
task.connect({"a":5})
task.get_parameters_as_dict(cast=True)
{'General': {'a': 5}, 'properties': {'x': '5'}} Hopefully would be a relatively easy extension of
get_user_properties ` !
it might be an issue in the UI due to this unconventional address or network settings
I think this is related to an https://github.com/allegroai/clearml-server/issues/112#issue-1149080358 that seems to be a reoccurring issue across many different setups
I ran into something similar, for me I'd actually cloned the repository using the address without the git@
(something made it work). ClearML read it from the remote repository URL and used it. When I updated the URL of the remote repository in my git client it then worked.
Hi CostlyOstrich36 thanks for the response and makes sense.
What sort of problems could happen, would it just be the corruption of the data that is being written or could it be more breaking?
For context, I’m currently backing up the server (spinning it down) every night but now need to run tasks over night and don’t want to have any missed logs/artifacts when the server is shutdown.
Ok, thanks Jake!
I think a note about the fileserver should be added to the https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_security page!
Is that deletion when deleting a task in the GUI?
From my limited understanding of it, I think it's the client that does the saving and communicating to the fileserver not the server, whereas deletion is done by the GUI/server which I guess could have different permissions somehow?
It seems to be an issue that a few people are having problems with: https://github.com/allegroai/clearml-server/issues/112
Ah apologies for getting the wrong end of the stick a bit!
Not sure if it helps you or not, but when the link to an artifact didn't work for me it was because the URL being used was internal to the server (I had an agent that had access to internal endpoints). In my case setting the agent fileserver url to the public domain solved my issue.
That said, maybe the connect dict is not the best solution for thousand key dictionary
Seems like it isn't haha!
What is the difference with connect_configuration
? The nice thing about it not being an artifact is that we can use the gui to see which hashes have changed (which admittedly when there are a few thousand is tricky anyway)
Yes please that would be great 👍
I realise I made a mistake and hadn't actually used connect_configuration
!
I think the issue is the bandwidth yeah, for example when I doubled the number of CPUs (which doubles the allowed egress) the time taken to upload halved. It is puzzling because as you say it's not that much to upload.
For now I've whittled down the number of entries to a more select but useful few and that has solved the issue. If it crops up again I will try connect_configuration
properly.
Thanks for ...
Yep GCP. I wonder if it's something to do with Container-Opimized OS, which is how I'm running the agents
connect_configuration
seems to take about the same amount of time unfortunately!
Thanks @<1523701087100473344:profile|SuccessfulKoala55> , I’ve taken a look and is this force merging you’re referring to? Do you know how often ES is configured to merge in clearml server?
Shards that I can see are using a lot of disk space are
events-training_stats_scalar
events-log
- And then various
worker_stats_*
I've tracked down our messages when this occurred and I think we had a different error to you, sorry.
In case it helps our problem was when the below command was run in the repository:$ git remote -v
Returned the https
address rather than the ssh
address.
Then clearml tried to convert this to the ssh
address, which looked like<org>/<repo>/
rather than:<org>/<repo>.git
(Which is possible a separate bug?)
Could well be the same as https://github.com/allegroai/clearml-server/issues/112 which is also discussed https://clearml.slack.com/archives/CTK20V944/p1648547056095859 🙂
Ah right, nice! I didn’t think it was as I couldn’t see it in the Task
reference , should it be there too?
OK that's great, thanks for the info SuccessfulKoala55 👍
Maybe it was the load on the server? meaning dealing with multiple requests at the same time delayed the requests?!
Possibly but I think the server was fine as I could run the same task locally and it took a few seconds (rather than 75) to upload. The egress limit on the agent was 32 Gbps which seems much larger than what I though I was sending but I don't have a good idea of what that limit actually means in practice!
CumbersomeCormorant74 just to confirm in my case the file's aren't actually deleted - I have to manually delete them from the fileserver via a terminal
Thanks CumbersomeCormorant74