Reputation
Badges 1
25 × Eureka!MysteriousBee56 , The agent is not running on the "server" it's running on its machine.
The server just reflects the fact he agent is up..
To actually take it down you need to SSH (or connect to that machine) and stop the actual trains-agent process.
What is exactly the scenario you had in mind?
Well that depends on how you think about the automation. If you are running your experiments manually (i.e. you specifically call/execute them), then at the beginning of each experiment (or function) call Task.init
and when you are done call Task.close
. This can be done in parallel if you are running them from separate processes.
If you want to automate the process, you can start using the trains-agent
which could help you spin those experiments on as many machines as you l...
RobustRat47
What exactly is the error you are getting ? (I remember only the latest Triton solved some issue there)
Another (minor) issue is that all the packages that are installed using git+https are cloned and installed twice, immediately one after the other
Yes this is so that we can better log the installed package name, not a major issue, but we just fixed a bug with derivative packages from git packages.
https://github.com/allegroai/trains/issues/196
UnevenDolphin73 go to the profile page, I think at the bottom right corner you should see it
(Also ctrl-F5 to reload the web application, if you upgraded the server 🙂 )
python version to be used and conda will install it
clearml does that automatically (albeit it is not shown in the UI, which should be fixed)
K8s can schedule pod with different priorities.
I'm not sure I agree here, could you refer me to the docs on this ability in k8s ?
So maybe no real scheduling means there is no ClearML scheduling after applying pod to k8s.
That is correct 🙂
Does it will implement in the future?
Yes, this is enterprise feature, in the community you can specify --max-pods limit (which will cause it never to pull a job if it hits the max-pod limit)
I see, let me check the code and get back to you, this seems indeed like an issue with the Triton configuration in the model monitoring scenario.
Is it possible to do something so that the change of the server address is supported and the pictures are pulled up on the new server from the new server?
The link itself (full link) is stored inside the server. Can I assume the access is IP based not host based (i.e. dns) ?
So I think this is a good example of pipelines and data:
Basically Task A generates data stored using the cleamrl-data
(See Dataset class). The output of that is an ID of the Dataset. Then Task B uses that ID to retrieve the Dataset created by Task A.
documentation
https://github.com/allegroai/clearml/blob/master/docs/datasets.md
Example:
Step A creating Dataset:
https://github.com/alguchg/clearml-demo/blob/main/process_dataset.py
Step B training model using the Dataset created in ...
JitteryCoyote63 what am I missing?
What are the errors you are getting (with / without the envs)
Hi @<1578555761724755968:profile|GrievingKoala83>
mount s3 as a cache folder
I'm not sure that would be fast enough for cache ...
How to override
/root/.cache/pip
path?
in your clearml.conf fille:
None
then set it to your PV
the latter is an ec2 instance
and the agent fails to install on the ec2 machine ?
Maybe before everything else, can you share some background on the rational if starting a new sub process?
Hey GiganticTurtle0 ,
So basically the issue is the the pipeline function ( prediction_service
) is getting a dict as input, and it is expecting to get basic types... if you were to do the following, it would have worked as expected.prediction_service(**default_config)
I will make sure we flatten any dictionary so that we end up with config/start
, instead of a serialized version of the dict.
wdyt?
packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
Wait I'm confused, this is inside a container, no?
and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Generally speaking you are correct, but some packages will not have the same version for all python versions
Specifically in this case I think...
Okay, I think I understand, but missing something. It seems you call get_parameters from old API , is your code actually calling get_parameters ? The trains-agent runs the code externally, whatever happens inside the agent should have now effect on the code. So who exactly is calling the task.get_parameters, and well, why ? :)
some dependencies will sometimes require different pip versions.
none 🙂 maybe setuptools, but not pip
version
(pip is just a utility to install packages, it will not be a dependency of one)
Create one experiment (I guess in the scheduler)
task = Task.init('test', 'one big experiment')
Then make sure the the scheduler creates the "main" process as subprocess, basically the default behavior)
Then the sub process can call Task.init and it will get the scheduler Task (i.e. it will not create a new task). Just make sure they all call Task init with the same task name and the same project name.
These both point to nvidia docker runtime installation issue.
I'm assuming that in both cases you cannot run the docker manually as well, which is essentially what the agent will have to do ...
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
It will not create another 100 tasks, they will all use the main Task. Think of it as they "inherit" it from the main process. If the main process never created a task (i.e. no call to Tasl.init) then they will create their own tasks (i.e. each one will create its own task and you will end up with 100 tasks)
@<1687643893996195840:profile|RoundCat60> can you access the web UI over https ?
MelancholyBeetle72 thanks! I'll see if we could release an RC with a fix soon, for you to test :)
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
MysteriousBee56 what do you mean "save Scalars on the machine"? All metrics are sent to the trains server. You can later retrieve them from code, if you need.
Hi FranticCormorant35
So Tasks have parent field, that would link one to another.
Unfortunately there is no visual representation for it.
What we did with the hyper-parameter for example, was also to add a tag with the ID of the "parent" Task. This would make sense if you have multiple tasks all generated from the same "parent", like in hyper-parameter optimization.
What's your use case ? Is it a single evaluation Task per training, or multiple or con job alike ?
RobustRat47 are you saying updating the nvidia drivers solved the issue ?