HI @<1687643893996195840:profile|RoundCat60>
Are you running on AWS ?
LovelyHamster1 from the top, we have two steps:
We run the code "manually" (i.e. without the agent) this step create the experiment (Task) and automatically feels in the "installed packages" (which are in the same format as regular requirements.txt) An agent is running a cloned copy of the experiment (Task). The agents creates a new venv on the agent's machine, then the agent is using the "Installed packages" section as a replacement to regular "requirements.txt" and installs everything fro...
VexedCat68 both are valid. In case the step was cached (i.e. already executed) the node.job will be None, so it is probably safer to get the Task based on the "executed" field which stores the Task ID used.
(Not sure it actually has that information)
It is way too much to pass on env variable π
Wait, why aren't you just calling Popen? (or os.system), I'm not sure how it relates to the torch multiprocess example. What am I missing ?
So I wonder - why should an agent be related to a specific user's credentials? Is the right way to go about this is to create a "fake user" for the sake of the agent?
Very true you have to have credentials for the trains-agent, so it can "report" to the trains-server, that said, the creator of the Task (i.e. the person who cloned it) will be registered as the "user" in the UI.
I would recommend to create an "agent" user and put it's credentials on the trains-agent machine (the same way...
Hi @<1709015393701466112:profile|ScatteredPeacock14>
I get 3 tasks created in total. Any ideas?
Could it be an old instance of the same Task?
Notice the for loop starts from 1 so it does include the master node:
None
the only problem with it is that it will start the task even if the task is completed
What is the criteria ?
Hmm if this is case, you can add some prints in here:
None
the service/action will tell you what you are sending
wdyt?
Yes... I think that this might be a bit much automagic even for clearml π
Is it possible in Clearml to somehow allocate resources so that maybe after running a number of Alice's tasks, Bob's task get processed (Like maybe Round robin fashion)
Hi DeliciousBluewhale87
A few options here:
set the agent with high / low priority queues. Make sure Alice pushes into low priority (aka HPO) then Bob can push into high priority when he needs. This makes a lot of sense when you have automation processes spinning many experiments. expanding (1) you could set differe...
SmallBluewhale13 in your code what are you getting when you print the version:from clearml import __version__ print(__version__)
This will set more time before the timeout right?
Correct.
task.freeze_monitor()
download()
task.defrost_monitor()
Currently there isn't, but that's a good ides.
What would be the argument of using it vs increasing the timeout ?
btw: setting the resource timeout to 99999 will basically mean that it will wait until the first reported iteration, Not that it will just sleep for 99999sec π
So I think this is a good example of pipelines and data:
Basically Task A generates data stored using the cleamrl-data
(See Dataset class). The output of that is an ID of the Dataset. Then Task B uses that ID to retrieve the Dataset created by Task A.
documentation
https://github.com/allegroai/clearml/blob/master/docs/datasets.md
Example:
Step A creating Dataset:
https://github.com/alguchg/clearml-demo/blob/main/process_dataset.py
Step B training model using the Dataset created in ...
Any chance you can share the Log?
(feel free to DM it so it will not end up public)
Hi JealousParrot68
no need for decorators, you can just pass the function to schedule_function=<function goes here>
π
See scheduler here
https://github.com/allegroai/clearml/blob/8708967a5ef4d8529a1a5ea417672e3ebbb258d7/clearml/automation/scheduler.py#L485
And triggers here:
https://github.com/allegroai/clearml/blob/8708967a5ef4d8529a1a5ea417672e3ebbb258d7/clearml/automation/trigger.py#L193
https://github.com/allegroai/clearml/blob/8708967a5ef4d8529a1a5ea417672e3ebbb258d7/clea...
Hi NastyFox63 yes I think the problem was found (actually backend side).
It will be solved in the upcoming release (due after this weekend π )
ldconfig fromΒ
/etc/profile
Β which is put there by the interactive_session_task
LackadaisicalOtter14 are you sure ? maybe this is done as part of the installation the interactive session runs ?
Could that be the issue ?apt-get update && apt-get install -y openssh-server
First that is awesome to hear PanickyFish98 !
Can you send the full exception? You might be on to something...
2. Actually we thought of it, but could not find a use case, can you expand?
3. I'm not sure I follow, do you mean you expect the first execution to happen immediately?
my question is how to recover, must i recreate the agents or there is another way?
Yes you have to recreate the Task (I assume they failed, no?!)
Was I right to put the credentials in
clearml.conf
on the machine I am starting the agent on?
AdventurousButterfly15 Yes exactly!
you should be able to see that in the log of the Task (at the top of the log there will be the entire configuration), can you see the git user there?
It actually started executing your code, but it did not capture it correctly:
/root/.clearml/venvs-builds/3.10/bin/python -u /root/.clearml/venvs-builds/3.10/code/colab_kernel_launcher.py
Which I assume means the actual Task had bad code.
What do you have under the Task execution tab in the UI (the one you were launching, i.e. enqueueing )
Hi DilapidatedDucks58 ,
I'm not aware of anything of this nature, but I'd like to get a bit more information so we could check it.
Could you send the web-server logs ? either from the docker or the browser itself.
The difference is whether you are only supplying a "minutes" or you are also passing hour/day etc.
See the examples:
Every 15 minutesadd_task(task_id='1235', queue='default', minute=15)
Every hour on minute 20 of the hour (i.e. 00:20, 01:20 ...)add_task(task_id='1235', queue='default', hour=1, minute=20)
Hi IntriguedRat44
You can make log it offline (i.e. into a local folder/zip) by calling:Task.set_offline(True)
You can also set the environment variable:TRAINS_OFFLINE_MODE=1
You could also just skip the Trains.init
call π
Does that help?
Hi RoundSeahorse20
Try the following , let me know if it worked.clear_logger = logging.getLogger('clearml.metrics') clear_logger.setLevel(logging.ERROR)
That's the theory, I still see it is not there
... grab the model artifacts for each, put them into the parent HPO model as its artifacts, and then go through the archive everything.
Nice. wouldn't it make more sense to "store" a link to the "winning" experiment. So you know how to reproduce it, and the set of HP that were chosen?
No that the model is bad, but how would I know how to reproduce it, or retrain when I have more data etc..
Hi SubstantialElk6ClearML-Data
doesn't actually "load" the data, it brings it locally and returns a folder with all your data files, from that point onward, it's up to your code to load it to the framework. Make sense ?