WickedGoat98 did you setup a machine with trains-agent pulling from the "default" queue ?
is this a config file on your side or something I can change, if we had enterprise version?
Yes, this is one of the things you can configure
Hi SubstantialBaldeagle49
2. Sure follow the back procedure and restore on the new server
3. Yes
task=Task.get_task(task_id='aa')
task.get_logger().report_scalar()
ContemplativeGoat37 I think there was an issues just lije you described and it was solved in later versions, upgrade to the latest clearml package version, you should be fine π
Hi @<1549202366266347520:profile|GorgeousMonkey78>
how do I integrate sagemaker with clearml ,
you mean to launch an experiment, or just to log it?
Hi VastShells92022-12-20 12:48:02,560 - clearml.automation.optimization - WARNING - Could not find requested hyper-parameters ['duration'] on base task a6262a151f3b454cba9e22a77f4861e3
Basically it is telling you it is setting a parameter it never found on the original Task you want to run the HPO o.
The parameter name should be (based on the screenshot) "Args/duration" (you have to add the section name to the HPO params). Make sense ?
I can't think of any actual difference in flow ...
Can you try the following?task._setup_reporter() task.set_initial_iteration(0)
but is there any other way to get env vars / any value or secret from the host to the docker of a task?
if this is docker -e/--env as argument would do the same-e VAR=somevalue
Hi SarcasticSparrow10
You will need to habe multiple trains-agent
s but they will be sharing the same queue (i.e. pulling jobs from the same queue the HPO process is pushing to)
Make sense ?
Task.init should be called before pytorch distribution is called, then on each instance you need to call Task.current_task() to get the instance (and make sure the logs are tracked).
Hi MelancholyElk85
So the way datasets now work, is they are actually an entity (folder) inside a project , all under TFW hidden .datasets sub project
This is so all data and tasks are both on the same project , but at the same time will not intersect with subprojects by the same name. Does that make sense?
Hmm, #790 should be solved in 1.7.2
Yes, I always see the "model uploaded completed" for such stuck tasksAny chance this is reproducible ?
How many processes do you see running (i.e. ps -Af | grep python) ?
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
AttractiveCockroach17
Can you print the configuration to console when you start he run (you will get a local print and then later the remote print), are they the same? Are the 3 runs the same (local / remote print)
Hi RipeGoose2
I just test the hydra example, seems to work when you add the offline right after the import:
` from clearml import Task
Task.set_offline(True) `
That makes total sense.
So right now you can probably use clearml-session to spin a session in any container, add the jupyterhub to the requirements like so:clearml-session --packages jupyterhub
Then ssh + run jupyerhub + tunnel port?ssh roo@IP -p 10022 -L 6666:localhost:6666 $ jupyterhub
Would that work?
Maybe it is better to add an option to use jupyterhub instead of jupyterlab ?
wdyt?
PanickyMoth78
LockException: [Errno 11] Resource temporarily unavailable
I'm not sure I understand how you got to this error (obviously creating datasets and getting them back works), what is unique in the setup/flow itself ?
Seems like settings on the clearml-server disappeared (specifically default queue tag?!)
PompousParrot44 Enterprise licensing pricing usually custom tailored to the size of the company and based on usage. If you are interested feel free to leave details in the "contact us" form on the website, and someone from sales will contact you shortly after.
ahh, because task_id is the "real" id of a task
Yes the ID is a global system wide unique ID (regardless of the project etc.)
Maybe we will call tasks as
slug_yyyymmdd
Notice that you can just copy-paste the link in the address bar, it will bring you to the exact same view, meaning easily shared among users π You can, but I would actually use the Task ID. This also means that programatically you can do , task=Task,get_task(task_id_here)
and interact and query a...
This would work to load the local modules, but Iβm also using poetry and the
pyproject.toml
is in the subdirectory, so the agent wonβt install any dependency if I donβt set the
work_dir
hmmm true, in terms of requirements, you can list them in the decorator (see packages
argument)
How can I track in clearML that this and that row was part of experiment x because it belonged to test/training data set y?
Hi @<1543766544847212544:profile|SorePelican79>
the experiments themselves will have a link to the Dataset they were using. From a dataset perspective, the idea is not to limit you, so essentially it will package all your files, and retrieve them when you fetch the datset. In terms of specifying a row / sample. My suggestion is to mark those rows when training a...
I would recommend reading this blog post, it should give you a glimpse of what can be built π
https://medium.com/pytorch/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail-b583d25d0dd
Hi EmbarrassedSpider34
Long story (see below) short, yes you can ignore this warning :)
Specifically, torch is spinning processes and killing them, every process will have a reference to the parent semaphore (for internal clearml bookkeeping), now python is not very good with this kind of thing (and it is getting better on newer python verions), bottom line python "think" someone lost a semaphore, but there reality is that subprocess never created it in the first place. Does that make sen...
Oh found it:temp.linux-aarch64-cpython-39
this is Arm?!