
Reputation
Badges 1
25 × Eureka!Hi ElegantCoyote26
sometimes the agents load an earlier version of one of my libraries.
I'm assuming some internal package that is installed from a wheel file not a direct git repo+commit link ?
GentleSwallow91 notice this part:
Hi Martin. Sorry - missed your reply.
Yeap I am aware that docker_internal_mounts is inside agent section.
'-v', '/tmp/ssh-XXXXXXnfYTo5/agent.8946:/tmp/ssh-XXXXXXnfYTo5/agent.8946', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-XXXXXXnfYTo5/agent.8946',
It is creating a copy of the ssh folder and setting the SSH_AUTH_SOCK env to it. You can just map the entire ssh folder automatically by un-setting SSH_AUTH_SOCK before running the agent.SSH_AUTH_SOCK= clearml-agent ...
So net-net does this mean itβs behaving as expected,
It is as expected.
If no "Installed Packages" are listed, then it cannot pull a cached venv (because requirements.txt is not a full env, and it never analyzed it)).
It does however create a venv cache based on it (after installing it)
The Clone of this Task (i.e. right click on the UI clone experiment, enqueue it, Will use the cached copy becuase the full packages are listed in the "Installed Packages" section of the Task.
Make sens...
so if i plot image with matplot lib..it would not upload? i need use the logger.
Correct, if you have no "main" task , no automagic π
so how can i make it run with the "auto magic"
Automagic logs a single instance... unless those are subprocesses, in which case, the main task takes care of "copying" itself to the subprocess.
Again what is the use case for multiple machines?
Yes that makes sense, if the overhead of the additional packages is not huge, I do not think it is worth the maintenance π
BTW clearml-agent has full venv caching that you can turn on, so when running remotely you are not "paying" for the additional packages being installed:
Un-comment this line π
https://github.com/allegroai/clearml-agent/blob/51eb0a713cc78bd35ca15ed9440ddc92ffe7f37c/docs/clearml.conf#L116
I want to optimizer hyperparameters with trains.automation but: ...
Yes you are correct, in case of the example code, it should be "General/..." if you have ArgParser, it should be "Args/..." Yes it looks like the metric is wrong, it should be "epoch_accuracy" & "epoch_accuracy"
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
so for example if there was an idle GPU and Q3 take it and then there is a task comes to Q2 which we specified 3GPU but now the Q3 is taken some of these GPU what will happen
This is a standard "race" the first one to come will "grab" the GPU and the other will wait for it.
I'm pretty sure enterprise edition has preemption support, but this is not currently part of the open source version (btw: also the dynamic GPU allocation, I think, is part of the enterprise tier, in the opensource ...
BTW: 0.14.3 solved the issue you are referring to, so you can import trains before / parsing the args without an issue. Regrading passing project/name as parameters. A few thoughts: (1) you can always rename / move projects from the UI (2) If you are running it with trains-agent
there is no meaning to these arguments, as by definition the Task was already created... Maybe we should give an option to exclude a few arguments from argparser, I think this topic came up a few times... What d...
Can you share the log?
Thanks @<1547028074090991616:profile|ShaggySwan64> !!
Passing to the backend guys to take a look
That sounds like an issue with "working dir" , check the "Execution" "Working Directory" field.
'.' means the root of the git repository
'subfolder' means run the script from the subfolder etc. also make sure that the script path is adjusted accordingly.
btw: Trains should have filled in all the correct paths... If you have time get the latest trains (0.14.3) and run again see if the problem consts, we should probably fix that bug π
Yea I know, I reported this
LOL, apologies these days it a miracle I still remember my login passwords π
VictoriousPenguin97 I'm assuming the exact same server version ?
Yeah, Curious - is a lot of clearml usecases not geared for notebooks?
That is somewhat correct, notebooks are not actually used with a lot of deep-learning projects as they require entire repository to support.
I guess generally speaking the workflow is, "test your code" (i.e. small scale with limited data), then clone and enqueue for remote execution.
That said, I think it will be great to expand the support.
TrickySheep9 I like the idea of context for Tasks, can you expand on how...
BTW: if you only need the git diff you can just copy them from the UI into a txt file and do:git apply <copied-diff.txt>
VictoriousPenguin97 I'm not sure there is an easy solution, basically you have to edit both MongoDB (artifacts) and Elastic (think debug samples) π
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
... Would not work for huge llm style models.
yes I agree... but then if the model is small enough then you can just keep it in memory ...
Could you give an example of such configurations ?
(e.g. what would be diff from one to another)
BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)
Hi @<1694157594333024256:profile|DisturbedParrot38>
You mean how to tell the agent to pull only some submodules of your git?
If this is the case you can actually remove them on your git branch, submodule is a file with a soft link. Wdyt?
Hope you donβt mind linking to that repo
LOL π
The issue only arises upon sending Images. (Both numpy, mpl and PIL)
BTW: they should appear under debug-samples
Tab in the results
Are they expanded in the "api_server" ? (I verified on a linux machine, same error, the env in the api_server is not being resolved)