
Reputation
Badges 1
981 × Eureka!Hi SuccessfulKoala55 , AgitatedDove14 ,
I updated to 1.4.0 (Web UI shows: WebApp: 1.5.0-186 โข Server: 1.5.0-186 โข API: 2.18
)
Unfortunately the bug is still there ๐
I donโt see errors in the console anymore though!
I had another look and modified a events.get_task_logs
request with a super old timestamp to try to retrieve all logs, this returned me only the few logs already displayed in the console. So I think the problem doesnโt come from the WebUI, but from the...
Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux
shows that the trains-agent is stuck running the first experiment, not the second...
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false
I need to call task.set_initial_iteration(0)
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2
(instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that itโs not possible to change this value after the index creation, is it true?
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1
Oops, I spoke to fast, the json is actually not saved in s3
yes, so it does exit the local process (at least, the command returns), but another process is still running on the background and is logging things from time to time (such as:)ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
For the moment this is what I would be inclined to believe
Ok thanks! And for this?
Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
This is new right? it detects the local package, uninstalls it and reinstalls it?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
I could delete the files manually with sudo rm
(sudo is required, otherwise I get Permission Denied
)
More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9
(both when executed in trains-agent and in local script)
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
I will let the team answer you on that one ๐
and the agent says agent.cudnn_version = 0
self.clearml_task.get_initial_iteration()
also gives me the correct number
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
What is latest rc of clearml-agent? 1.5.2rc0?
ok, but will it install as expected the engine and its dependencies?
Thanks a lot AgitatedDove14 !
AgitatedDove14 This looks awesome! Unfortunately this would require a lot of changes in my current code, for that project I found a workaround ๐ But I will surely use it for the next pipelines I will build!
Yes, thanks! In my case, I was actually using TrainsSaver from pytorch-ignite with a local path, then I understood looking at the code that under the hood it actually changed the output_uri of the current task, thats why my previous_task.output_uri = "
s3://my_bucket
" had no effect (it was placed BEFORE the training)
and saved locally, which is why the second task, not executed in the same machine, cannot access the file
nvm, bug might be from my side. I will open an issue if I find any easy reproducible example
thanks for your help!