
Reputation
Badges 1
25 × Eureka!I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
The package is just subdir by the way. So it should not be in installed packages anyways, right?
Correct, also when the agent is spinning the code it will automatically add the root of the git repository to the pythonpath so you should be able to load the package.
can I add user properties to a scheduler configuration?
please expand, what do you mean by user property and how one would use it?
So essentially, the server helm chart creates randomly generated secret pair and deploys it as a shared k8 secret that pods can access.
This is the tricky part, for the helm chart to be able to create it, it means it can login to the server it means there is a secret embedded in the helm chart that lets you access the default server. you see my point ?
ElegantKangaroo44 my bad π I missed the nuance in the description
There seems to be an issue in the web ui -> viewingΒ plots in "view in experiment table" doesn't respect the "scalars to display" one sets when viewing in "view in fullscreen".
Yes the info-panel does not respect the full view selection, It's on the to do list to add this ability, but it is still no implemented...
2 and 3 - I want to manage access control over the RestAPI
Long story short, put a load-balancer in front of the entire thing (see the k8s setup), and have the load-balancer verify JWT token as authentication (this is usually the easiest)
1- Exactly, custom code
Yes, we need to add a custom example there (somehow forgotten)
Could you open an Issue for that?
in the meantime:
` #
Preprocess class Must be named "Preprocess"
No need to inherit or to implement all methods
lass P...
ElegantKangaroo44 it seems to work here?!
https://demoapp.trains.allegro.ai/projects/0e152d03acf94ae4bb1f3787e293a9f5/experiments/48907bb6e870479f8b230e6b564cd52e/output/metrics/plots
What's the error you are getting ?
Awesome! Any chance you feel like contributing it, I'm sure ppl would be thrilled π
Yes, because when a container is executed, the agent creates a new venv and inherits from the system wide installed packages, but it cannot inherit or "understand" there is an existing venv, and where it is.
The problem is that even when I mount the SSH key into the root home directory (e.g.,
/root/.ssh/id_rsa
with the correct permissions set to 400) I still encounter the same error.
The agent automatically mount's the .ssh folder from the host into the container, making sure all the permissions are set,
how can I run
pip install -e .
in general the agent will add the "working" dir into the PYTHONPATH so that you should not have to manually do "-e ."
Tha...
I callΒ
Task.init
Β after I import tensorflow (and thus tensorboard?)
That should have worked...
Can you manually add a TB report before calling opennmt
function ?
(I want to verify the Task.init is indeed catching the TB calls, my theory is that somewhere inside the opennmt
we loose the TB)
Oh no π I wonder if this is connected to:
Any chance the logger is running (or you have) from a subprocess ?
SweetGiraffe8 Works when I'm using plotly...
Can you please copy paste the code with the plotly, it's probably something I'm missing
for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM
Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html
Hi @<1547390438648844288:profile|ScaryJellyfish75>
These hyperpaters are now in the "Args" section of my Clearml task
Sure that would probably mean
UniformParameterRange(
"Args/training/optimizer/lr",
min_value=0.00025,
max_value=0.01,
step_size=0.00025,
),
assuming your Task has training/optimizer/lr
in its Args section (under configuration tab), make sense ?
DrabCockroach54 that is quite cool!
Basically here is what I would do
Query Tasks that are both Running and Do not have system tag "development" (that means running on agents) + filter only tasks that start say 10 min ago Go over the list and see if (1) they have GPU scalar reported (meaning GPU is accessible) (2) min/max/val of GPU utilization is under 5%wdyt?
I have a question regarding running the code on the remote machine, each time I run the code I see the console in the ClearML server start downloading all the libraries I used in the code and when I run another code the same thing happens so why it has to download all the libraries again and many times?
I'm assuming you are referring to the installation, the downloaded python packages are cached.
You can turn on full caching by uncommenting the following line:
https://github.com/alleg...
query
tasks
that are both Running --> You mean
status=["in_progress"]
Yes!
How do I figure out other possible parameter I can use with
status
parameter?
https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
https://clear.ml/docs/latest/docs/references/api/definitions#taskstask
Filter only tasks that start say
10 min ago
. Is there any parameter for it also ?
last_update or created then use...
This would be a good example?
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py
Could it be someone deleted the file? this is inside the temp venv folder but it should not get there
You should have metric :monitor:gpu
variant gpu_0_utilization
Since I see you have none of those, that points to no GPU driver ...
Could that be ?
Yeah the doctring is always the most updated π
BTW: the new documentation should contain a full search over the docstring
CrookedWalrus33
Force SSH git authentication, it will auto mount the .ssh from the host to the docker
https://github.com/allegroai/clearml-agent/blob/6c5087e425bcc9911c78751e2a6ae3e1c0640180/docs/clearml.conf#L25
I tested and I have no more warning messages
if self._active_gpus and i not in self._active_gpus: continue
This solved it?
If so, PR pretty please π
Hi IrritableJellyfish76
https://clear.ml/docs/latest/docs/references/sdk/task#taskget_tasks
task_name
(
str
) β The full name or partial name of the Tasks to match within the specified
project_name
(or all projects if
project_name
is
None
). This method supports regular expressions for name matching. (Optional)
You are right, this is a bit confusing, I will make sure that we add in the docstring an examp...
still it is a chatgpt interface correct ?
Actually, no. And we will change the wording on the website so it is more intuitive to understand.
The idea is you actually train your own model (not chatgpt/openai) and use that model internally, which means everything is done inside your organisation, from data through training and ending with deployment. Does that make sense ?