Reputation
Badges 1
25 × Eureka!. Iβm using the default operation mode which uses kubectl run. Should I use templates and specify a service in there to be able to connect to the pods?
Ohh the default "kubectl run" does not support the "ports-mode" π
Thereβs a static number of pod which services are created forβ¦
You got it! π
we can add non-clearml code as a step in the pipeline controller.
Yes π , btw you can kind of already do that, with pre/post function callbacks (notice they are running from the same scope as the actual pipeline controller).
What exactly did you have in mind to put there ?
but actually that path doesn't exist and it is giving me an error
So you are saying you only uploaded the "meta-data" i.e. a text file with links to the files, and this is why it is missing?
Is there a way to change the path inside the .txt file to clearml cache, because my images are stored in clearml cache only
I think a good solution would be to store the path in the txt file as relative path, i.e. instead of /Users/adityachaudhry/data/folder... as ./data/folder
Hi @<1791277437087125504:profile|BrightDog7>
Seems like mostly proportion change, the data is the same but the layout on the web is wider hence the change (btw you can download the data from the web UI as json to double check)
Notice that it tries to convert it to "interactive" data points for easier zooming etc, and that's probably the cause of the proportion change.
You can force an image (like what you get directly from matplotlib):
logger.report_matplotlib_figure(
title='NLLs...
So if I pass a function that pulls the most recent version of a Task, it'll grab the most recent version every time it's scheduled?
Basically you function will be called, that's it.
What I'm assuming is that you would want that function to find the latest Task (i.e. query based & filter based on project/name/tag etc), clone the selected Task and Enqueue it,
is that correct?
BoredHedgehog47 can you provide some logs, this is odd..
Yes, I think you are correct, verified on Firefox & Chrome. I'll make sure to pass it along.
Thanks SteadyFox10 !
Hi LovelyHamster1
As you noted, passing overrides in Args/overrides , for example ['training.max_epochs=1000']
should work when running with the agent.
Could you verify with the latest RC, there was a fix to support the latest hydra versionpip install clearml==0.17.5rc5
The reasoning is that most likely simultaneous processes will fail on GPU due to memory limit
Ok, but whenΒ
nvcc
Β is not available, the agent uses the output fromΒ
nvidia-smi
Β right? On one of my machine,Β
nvcc
Β is not installed and in the experiment logs of the agent runnin there,Β
agent.cuda =
Β is the version shown withΒ
nvidia-smi
Already added to the next agent's version π
Okay that might explain the issue...
MysteriousBee56 so what you are saying ispython3 -m trains-agent --help does NOT work
but trains-agent --help does work?
Hi @<1641611252780240896:profile|SilkyFlamingo57>
. It is not taking a new pull from Git repository.
When you are saying it's not trying to get the latest, are you referring to a new run of the pipeline, and then the component being pulled is Not pulling the latest from the branch, is that the issue?
When you click on the component Task details (i.e. right hand side panel "Full details"), what's the commit ID you have?
Lastly, is the component running on the same machine as the prev...
Hmm is this similar to this one https://allegroai-trains.slack.com/archives/CTK20V944/p1597845996171600?thread_ts=1597845996.171600&cid=CTK20V944
Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?
The warning just let's you know the current processes stopped and itis being launched on a remote machine.
What am I missing? Is the agent failing to run the job that you create manually ?
(notice that when creating a job manually, there is no "execute_remotely", you just enqueue it, as it is not actually "running locally")
Make sense ?
ShallowGoldfish8 how did you get this error?self.Node(**eager_node_def) TypeError: __init__() got an unexpected keyword argument 'job_id'
"Updates a few seconds ago"
That just means that the process is not dead.
Yes that seemed to be stuck π
Any chance you can verify with the RC version?
I'll try to dig into the commits, maybe I can come up with an explanation ...
Hi @<1523715429694967808:profile|ThickCrow29> , thank you for pinging!
We fixed the issue (hopefully) can you verify with the latest RC? 1.14.0rc0 ?
Hmm no, the publish is internal "publish" state, it will not just "open" your server to the world (I guess that would be weird π )
SmallDeer34 so maybe rerun it on the "free hosted" server and share it there?
(I'm assuming you do not intent to just open access to your own server)
if I encounter the need for that, I will adapt and open a PRΒ
Great!
Yes you have to spin the server in order to generate the access/secret key...
BTW: you still can get race/starvation cases... But at least no crash
CheerfulGorilla72
upd: I see NAN in the tensorboard, and 0 in Clearml.
I have to admit, since NaN's are actually skipped in the graph, should we actually log them ?
Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
. Yes I do have a GOOGLE_APPLICATION_CREDENTIALS environment variable set, but nowhere do we save anything to GCS. The only usage is in the code which reads from BigQuery
Are you certain you have no artifacts on GS?
Are you saying that if GOOGLE_APPLICATION_CREDENTIALS and clearml.conf contains no "project" section it crashed when starting ?
Exporter would be nice I agree, not sure it is on the roadmap at the moment π
Should not be very complicated to implement if you want to take a stab at it.
The class documentation itself is also there under "References" -> "Trains Python Package"
Notice that due to a bug in the documentation (we are working on a fix) the reference part is not searchable in the main search bar
It will also allow you to pass them to Hydra (wither as overloaded, or directly edit the entire hydra config)