As to why: This is part of the piping that I described in a previous message: Task B requires an artifact from task A, so I pass the name of the artifact as a parameter of task B, so that B knows what artifact from A it should retrieve
AgitatedDove14 Is it fixed with trains-server 0.15.1?
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
did you try with another availability zone?
Ok so the problem was indeed the way docker was installed (with snap)
Ok, I guess I’ll just delete the whole loss series. Thanks!
Yea thats what I thought, I do have trains server 0.15
Alright, I had a look in the /tmp/.trains_agent_daemon_outabcdef.txt logs, not many insights from here. For the moment, I simply started a new trains-agent daemon in services mode and I will wait to see what happens.
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
might be worth documenting 😄
This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
But we can easily extend, right?
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
trains==0.16.4
Ho and also use the colors of the series. That would be a killer feature. Then I simply need to match the color of the series to the name to check the tags
I am now trying with agent.extra_docker_arguments: ["--network='host'", ] instead of what I shared above
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
nvm, bug might be from my side. I will open an issue if I find any easy reproducible example
Yes, but I am not certain how: I just deleted the /data folder and restarted the server
Yes that’s correct - the weird thing is that the error shows the right detected region
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
If I remove security_group_ids and just let subnet_id in the configuration, it is not taken into account (the instances are created in a the default subnet)
CostlyOstrich36 Were you able to reproduce it? That’s rather annoying 😅
