Reputation
Badges 1
27 × Eureka!did i have to configure the environment first maybe? i assumed it just uses the environment where it was called
thanks thats exactly what i looked for!!
yes i can communicate with the server, i managed to put tasks in the queue and retrieve them as well as running tasks with metrics reporting
sdk.development.store_uncommitted_code_diff: false api.verify_certificate : false api { web_server: https://<...>.com:8080 api_server: https://<...>.com:8008 files_server: https://<...>.com:8081 credentials { "access_key" = "OMF..." "secret_key" = "oox..." } }
clearml 1.1.6 clearml-agent 1.1.2
no output at all, so nothing to paste
AgitatedDove14 its the same file system, so it would be better just to use the original code files and the same conda env. if possible…
correct. just verified again now.
AgitatedDove14 that worked! but i had to add:os.environ['CLEARML_PROC_MASTER_ID'] = '' os.environ['TRAINS_PROC_MASTER_ID'] = ''
or else it tought it was the parent optimizer task i was trying to run.
but now im facing new issue, the details are empty:
where address is our server adderss starting with https://.. etc
` sdk.development.store_uncommitted_code_diff: false
api.verify_certificate : false
api {
web_server: <ADDRESS>:8080
api_server: <ADDRESS>:8008
files_server: <ADDRESS>:8081
credentials {
"access_key" = "OMF..."
"secret_key" = "oox..."
}
} `
I think i solved it by deleting the project and running the base_task one time before the hyper parameter optimzation
Nice catch! (I’m assuming you also called Task.init somewhere before, otherwise I do not think this was necessary)
i was calling task init and it still somehow tought its the parent task, until i fixed it as i said.
and yes, everything is working now! im running hyper parameter optimzation on LSF cluster where every task is an LSF job running without clearml-agent
"os": "Linux-4.18.0-348.2.1.el8_5.x86_64-x86_64-with-glibc2.28", "python": "3.9.7"
i dont have agent configuration file if this might be the problem
all the machines share the same file system so i managed to do all the things i mentioned from different machines on the system
i can create tasks and reterive them from the queues
Im not sure what exactly your asking, someone else configured the server, im just using it
already reported table will be the best, otherwise any other table i can log new lines to
can you get the agent to execute the task on the current conda env without setting up new environment? or is there any other way to get task from the queue running locally in the current conda env?
now i noticed clearml-agent list
gets stuck as well
Im still trying to figure out what is the best way to execute task on LSF cluster, the easiest way possible would be if could just some how run task and let the lsf manage the environment, on the same filesystem it is very easy to use shared conda env etc