Reputation
Badges 1
979 × Eureka!But you might want to double check
This is the mapping of the faulty index:
` {
"events-plot-d1bd92a3b039400cbafc60a7a5b1e52b_new" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"@timestamp" : {
"type" : "date"
},
"iter" : {
"type" : "long"
},
"metric" : {
"type" : "keyword"
},
"plot_data" : {
"type" : "binary"
},
"plot_len" : {
"type" : "long"
},
"plot_str" : {
...
I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
MagnificentSeaurchin79 You could also just fork the tensorflow repo, make changes in a specific branch and specify your forked repo with your custom branch in the install_requires of your setup.py
Both ^^, I already adapted the code for GCP and I was planning to adapt to Azure now
Ok thanks! And for this?
Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
on /data or /opt/clearml? these are two different disks
And I can verify that ~/trains.conf exists in the su home folder
Not really: I just need to find the one that is compatible with torch==1.3.1
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
Hi CostlyOstrich36 ! no I am running on venv mode
trains==0.16.4
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running
It worked like a charm 😱 Awesome thanks AgitatedDove14 !
I will probably just use everywhere an absolute path to be robust against different machine user accounts: /home/user/trains.conf
AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didn’t
trains-elastic container fails with the following error:
I assume you’re using a self-hosted server?
Yes
Would adding a ILM (index lifecycle management) be an appropriate solution?
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
This works well when I run the agent in virtualenv mode (remove --docker
)
I think waiting for the apt locks to be released with something like this would workstartup_bash_script = [ "#!/bin/bash", "while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done", "sudo apt-get update", ...
Weirdly this throws an error in the autoscaler:
` Spinning new instance type=v100_spot
Error: Failed to start new instance, unexpected '{' in field...
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
AgitatedDove14 , my “uncommitted changes” ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
So in my minimal reproducable example, it does work 🤣 very frustrating, I will continue searching for that nasty bug