Reputation
Badges 1
27 × Eureka!thanks thats exactly what i looked for!!
clearml 1.1.6 clearml-agent 1.1.2
no output at all, so nothing to paste
"os": "Linux-4.18.0-348.2.1.el8_5.x86_64-x86_64-with-glibc2.28", "python": "3.9.7"
i can create tasks and reterive them from the queues
can you get the agent to execute the task on the current conda env without setting up new environment? or is there any other way to get task from the queue running locally in the current conda env?
AgitatedDove14 that worked! but i had to add:os.environ['CLEARML_PROC_MASTER_ID'] = '' os.environ['TRAINS_PROC_MASTER_ID'] = ''
or else it tought it was the parent optimizer task i was trying to run.
but now im facing new issue, the details are empty:
I think i solved it by deleting the project and running the base_task one time before the hyper parameter optimzation
Nice catch! (I’m assuming you also called Task.init somewhere before, otherwise I do not think this was necessary)
i was calling task init and it still somehow tought its the parent task, until i fixed it as i said.
and yes, everything is working now! im running hyper parameter optimzation on LSF cluster where every task is an LSF job running without clearml-agent
now i noticed clearml-agent list
gets stuck as well
already reported table will be the best, otherwise any other table i can log new lines to
sdk.development.store_uncommitted_code_diff: false api.verify_certificate : false api { web_server: https://<...>.com:8080 api_server: https://<...>.com:8008 files_server: https://<...>.com:8081 credentials { "access_key" = "OMF..." "secret_key" = "oox..." } }
Im not sure what exactly your asking, someone else configured the server, im just using it
correct. just verified again now.
` sdk.development.store_uncommitted_code_diff: false
api.verify_certificate : false
api {
web_server: <ADDRESS>:8080
api_server: <ADDRESS>:8008
files_server: <ADDRESS>:8081
credentials {
"access_key" = "OMF..."
"secret_key" = "oox..."
}
} `
yes i can communicate with the server, i managed to put tasks in the queue and retrieve them as well as running tasks with metrics reporting
where address is our server adderss starting with https://.. etc
did i have to configure the environment first maybe? i assumed it just uses the environment where it was called
i dont have agent configuration file if this might be the problem
all the machines share the same file system so i managed to do all the things i mentioned from different machines on the system
Im still trying to figure out what is the best way to execute task on LSF cluster, the easiest way possible would be if could just some how run task and let the lsf manage the environment, on the same filesystem it is very easy to use shared conda env etc
AgitatedDove14 its the same file system, so it would be better just to use the original code files and the same conda env. if possible…