Manually I was installing the
leap
package through
python -m pip install .
when building the docker container.
NaughtyFish36 what happnes if you add to your "installed packages" /opt/keras-hannd
? This should translate to "pip install /opt/keras-hannd" which seems like exactly what you want, no ?
without the ClearML Server in-between.
You mean the upload/download is slow? What is the reasoning behind removing the ClearML server ?
ClearML Agent per step
You can use the ClearML agent to build a socker per Task, so all you need is just to run the docker. will that help ?
seems like I'm passing in my own docker image which is then used at run time?
You are passing the Default docker image, if the Task does not list a specific docker image it will use the one you passed.
Yes this is "docker mode" (in venv mode no dockers are used, it just creates a new venv per experiment and installs everything inside the venv)
ldconfig from
/etc/profile
which is put there by the interactive_session_task
LackadaisicalOtter14 are you sure ? maybe this is done as part of the installation the interactive session runs ?
Could that be the issue ?apt-get update && apt-get install -y openssh-server
On my to do list, but will have to wait for later this week (feel free to ping on this thread to remind me).
Regrading the issue at hand, let me check the requirements it is using.
BTW:str('\.') Out[4]: '\\.' str(('\.', )) Out[5]: "('\\\\.',)"
This is just python str casting
Hi DrabCockroach54
I think the Kubernetes integration (k8s glue) is not part of the open-source features, and is only available as enterprise feature 😞
BoredHedgehog47 this is basically a wizard explaining the steps, see the 3 tabs 🙂
BTW, you can launch an experiment directly from CLI with clearml-task
https://clear.ml/docs/latest/docs/apps/clearml_task
EnviousStarfish54 you can also run the docker-compose on one of the machines on your local LAN. but then you will not be able to access it from home 🙂
off the top of my head, the self hosted is missing the autoscalers (there is an AWS CLI, but no UI or others), also missing a the HPO UI feature,
but you should just check the detailed table here: None
Yes, actually ensuring pip is there cannot be skipped (I think in the past it cased to many issues, hence the version limit etc.)
Are you saying it takes a lot of time when running? How long is the actual process that the Task is running (just to normalize times here)
2 and 3 - I want to manage access control over the RestAPI
Long story short, put a load-balancer in front of the entire thing (see the k8s setup), and have the load-balancer verify JWT token as authentication (this is usually the easiest)
1- Exactly, custom code
Yes, we need to add a custom example there (somehow forgotten)
Could you open an Issue for that?
in the meantime:
` #
Preprocess class Must be named "Preprocess"
No need to inherit or to implement all methods
lass P...
SmarmySeaurchin8
updated_tags = task.tags
updated_tags.remove(tag)
task.tags = updated_tags
Hi PanickyMoth78
can receive access to a GCP project and use GKE to spin clusters up and workers or would that be on the customer to manage.
It does, and also supports AWS.
That said only the AWS is part of the open-source, but both are parts of the paid tier (I think Azure is in testing)
Hi HelpfulHare30
I mean situations when training is long and its parts can be parallelized in some way like in Spark or Dask
Yes that makes sense, with both the function we are paralleling usually bottle-necked in both data & cpu, and both frameworks try to split & stream the data.
ClearML does not do data split & stream, but what you can do is launch multiple Tasks from a single "controller" and collect the results. I think that one of the main differences is that a ClearML Task is ...
. Can I get gpu usage over time frame via API also?
task.get_reported_scalars
But this will get you All the scalars, I think the next version of the server supports asking a specific one as well.
How are you implementing the alert monitoring?
Is is a stateless process starting every X min, or is it a state-full process running and monitoring ?
Check on which queue the HPO puts the Tasks, and if the agent is listening to these queues
TroubledHedgehog16 if you have a preinstalled conda env then why would you need to reinstall it from yml file? Also if this is the default python env, clearml-agent will inherit from it and use i, (no real overhead there)
Notice the reason for "inheriting system" python environments is so that the agent could cache the individual Task requirements, meaning next time it will not need to reinstall anything
wdyt?
DeliciousBluewhale87 fyi, the new version of the pipeline (hopefully pushed towards the end of this week), will allow you to more easily write steps as functions (not only as Tasks, as the current implementation)
Also check the new Trigger and Scheduler both intended to trigger these pipelines:
https://github.com/allegroai/clearml/blob/fe3c481c37e70881c44d67c1cf9bbce00a84747e/clearml/automation/scheduler.py#L457
https://github.com/allegroai/clearml/blob/fe3c481c37e70881c44d67c1cf9bbce00a8...
WackyRabbit7 the auto detection will only import direct packages you import (so that we do not end up with bloated venvs)
It seems that the transformers
library does not have it as a requirements, otherwise it would have pulled it...
In your code you can always do either:import torch
orTask.add_requirements('torch')
A few implementation / design details:
When you run code with Trains (and call init) it will record your environment (python packages, git code, uncommitted changes etc) Everything is stored on the Task object in the trains-server, when you clone a task you literally create a copy of the Task object (i.e. a second experiment). on the cloned experiment, you can edit everything (parameters, git, base docker image etc) When you enqueue a Task you add its ID to the execution queue list a trains-a...
NaughtyFish36
No module named 'leap.learn.data_tools.merge_data.merge_data'
This seems to be the error but I cannot see leap
in the installed packages , Notice that if the Task has "Installed Packages" section then the agent will use that Not the "requirements.txt" , Only if this section is Empty it will revert to the "requirements.txt" in the repo.
How did you create the Task in the first place?
I see that you added "leap" into the initial bashscript, actually you should add i...
Okay this more complicated but possible.
The idea is to write a glue layer (service) that pulls from the (i.e UI) queue
sets the slurm job
and puts it in a pending queue (so you know the job s waiting in the slurm scheduler)
There is a template here:
https://github.com/allegroai/trains-agent/blob/master/trains_agent/glue/k8s.py
I would love to help and setup a slurm glue in a similar manner
what do you think?
That is correct.
Obviously once it is in the system, you can just clone/edit/enqueue it.
Running it once is a mean to populate the trains-server.
Make sense ?
What if I register the artifact manually?
task.upload_artifact('local folder', artifact_object='
')
This one should be quite quick, it's updating the experiment
while I'm looking to upload local weights
Oh, so this is not "importing uploaded (exiting) model" but manually creating a Model.
The easiest way to do that is actually to create a Task for Model uploading, because the model itself will be uploaded to unique destination path, and this is built on top of the Task.
Does that make sense ?
Hi DrabCockroach54
Do we know if gpu_0_mem_usage and gpu_0_mem_used_gb, both shows current GPU usage?
the first is percentage used (memory % used at any specific moment) and the second is memory used GiB , both for the video memory
How to know from this how much GPU is reserved for the task if this task is in progress?
What do you mean by how much is reserved ? Are you running with an agent?
Which means you currently save the argument after resolving and I'm looking to save them explicitly so the user will not forget to change some dependencies.
That is correct
I'm looking to save them explicitly so the user will not forget to change some dependencies.
Hmm interesting point. What's the use case for storing the values before the resolving ?
Do we want to store both ?
The main reason for storing the post resolve values, is that you have full visibility to the actual...
BTW: CloudyHamster42 I think this issue was discussed on GitHub, and the final "verdict" was we should have an option to split/combine graphs on the UI side (i.e. similar to the "smoothing" or wall-time axis etc.)