What's the error you are getting ?
(open the browser web developer, see if you get something on the console log)
Hi @<1523701240951738368:profile|RoundMosquito25>
Sure you can 🙂
task = Task.get_task("task_id_here")
metrics = task.get_last_scalar_metrics()
print(metrics[":monitor:gpu"])
shows that the trains-agent is stuck running the first experiment, not
the trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee
is the second trains-agent instance running inside the docker, if the task is aborted, this process should have quit...
Any suggestions on how I can reproduce it?
PlainSquid19 Trains will analyze the entire repository if this is a git repo code, and a single script file if there is no repository found.
It will not analyze an entire folder if it is not in a git repository, because it will not be able to recreate this folder anyhow. Make sense ?
Hi @<1524560082761682944:profile|MammothParrot39>
By default you have the last 100 iterations there (not sure why you are only seeing the last 3), but this is configurable:
None
Hi HandsomeCrow5 .
Remember the debug images are events with links to the actual images, so you first have to get the events and then you can download the images with https://allegro.ai/docs/examples/examples_storagehelper/#storagemanager (which by definition has the credentials, because it was able to upload them 🙂
To get the events:from trains.backend_api.session.client import APIClient client = APIClient() client.events.debug_images(task='aabbcc')
CloudyHamster42 FYI the warning will not be shown in the next Trains version, the issue is now fixed, thank you 🙂
Regrading the double axes, see if adding plt.clf() helps. It seems the axes are leftover from the previous figure, that somehow are still there...
Yep 🙂
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?
Hmm, how does your preprocessing code looks like?
No worries, you should probably change it to pipe.start(queue= 'queue') not start locally
s it working when you are calling it with start locally ?
Whoa, are you saying there's an autoscaler that
doesn't
use EC2 instances?...
Just to be clear the ClearML Autoscaler (aws) will spin instances up/down based on jobs in the queue it is listening to (the type of EC2 instances and configuration is fully configurable)
(also could you make sure all posts regrading the same question are put in the thread of the first post to the channel?)
Hi LudicrousParrot69
A bit of background:
A Task is a job executed in the system (sometime it is an experiment training, sometime a controller like the pipeline). Basically everything process can be a task.
Specifically the pipeline controller itself (i.e. the process running the Bayesian optimization) is Task in the system (i.e. a job running). What it does (using the HyperParameterOptimizer) is cloning previously executed Tasks (e.g. training experiments), change their parameters and moni...
An upload of 11GB took around 20 hours which cannot be right.
That is very very slow this is 152kbps ...
DistressedGoat23 you are correct, since at the end this become a plotly object the extra_layout is for general purpose layout, but this specific entry is next to the data. Bottom line, can you open a github issue, so we do not forget to fix? In the mean time you can use the general plotly reporting as SweetBadger76 suggested
StraightDog31 can you elaborate? where are the parameters stored? who is trying to access them, and maybe for what purpose ?
Done HandsomeCrow5 +1 added 🙂
btw: if you feel you can share how your reports looks like (screen shot is great), this will greatly help in supporting this feature , thanks
Thank you! 😊
StraightDog31 how did you get these ?
It seems like it is coming from maptplotlib, no?
Hmm BitterStarfish58 what's the error you are getting ?
Any chance you are over the free tier quota ?
The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
Yes there is an issue when you have both git repo and totally uncommitted file, since clearml can store either standalone script or a git repository, the mix of the two is not actually supported. Does that make sense ?
Ohh I see, so basically the ASG should check if the agent is Idle, rather than the Task is running ?
yes 🙂
But I think that when you get the internal_task_representation.execution.script you are basically already getting the API object (obviously with the correct version) so you can edit it in place and pass it too
So as you say, it seems hydra kills these
Hmm let me check in the code, maybe we can somehow hook into it
Btw it seems the docker runs in
network=host
Yes, this is so if you have multiple agents running on the same machine they can find a new open port 🙂
I can telnet the port from my mac:
Okay this seems like it is working
It only happens in the clearml environment, works fine local.
Hi BoredHedgehog47
what do you mean by "in the clearml environment" ?
Hi @<1690896098534625280:profile|NarrowWoodpecker99>
Once a model is loaded into GPU memory for the first time, does it stay loaded across subsequent requests,
yes it does.
Are there configuration options available that allow us to control this behavior?
I'm assuming your're thinking dynamic loading/unloading models from memory based on requests?
I wish Triton added that 🙂 this is not trivial and in reality to be fast enough the model has to leave in RAM then moved to GPU (...
from the notebook run !ls ~/clearml.conf