Reputation
Badges 1
25 × Eureka!After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
The one in the highest queue (if you have multiple queues)
If you use fairness it will pull in round robin from all queues, (obviously inside every queue it is based on the order of jobs).
fyi, you can reorder the jobs inside the queue from the UI π
DeliciousBluewhale87 wdyt?
requirements specified with git repo
you mean the reuqirements.txt is inside the gir repo? or do you mean a link to the git-repo as part of the requirements?
Can you also provide an example of the content, I think I have an idea
Hi @<1671689437261598720:profile|FranticWhale40>
You mean the download just fails on the remote serving node becuause it takes too long to download the model?
(basically not a serving issue per-se but a download issue)
I'm assuming some package imports absl (the TF define package) and that's the reason you see the TF defines). Does that make sense?
Okay, progress.
What are you getting when running the following from the git repo folder:git ls-remote --get-url origin
BattyLion34 if everything is installed and used to work, what's the difference from the previous run that worked ?
(You can compare in th UI the working vs non-working, and check the installed packages, it would highlight the diff, maybe the answer is there)
but the requirement was already satisfied.
I'm assuming it is satisfied on the host python environment, do notice that the agent is creating a new clean venv for each experiment. If you are not running in docker-mode, then you ca...
Hi ElegantCoyote26
If there is, it will have to be using the docker-mode, but I do not think this is actually possible because this is not a feature of docker. It is possible to do on k8s, but that's a diff level of integration π
EDIT:
FYI we do support k8s integration
hmm can you share the log of the Task? (the clearml-session created Task)
Could it be the Args section of the task it clones does not have the "input_train_data" argument ?
A few epochs is just fine
should I update nodejs in centos image ?
I think so, it might have been forgotten
AstonishingRabbit13
https://github.com/googleapis/google-cloud-python/issues/4941#issuecomment-369472576
check the openssl and the date, this seems like SSL low level error (even before authentication)
Hmm seems like everything is working, can you check in the UI if you see the serving session ID in the DevOps project? maybe there are two, and you configured one an dthe docker-compose is running another ?
overrides -> "kubectl run --overrides "
template -> "kubectl apply template.yaml"
The bug was fixed π
Can you test with the credentials also in the global section
None
key: "************"
secret: "********************"
Also what's the clearml python package version
Hi RipeGoose2
You can also report_table them? what do you think?
https://github.com/allegroai/clearml/blob/master/examples/reporting/pandas_reporting.py
https://github.com/allegroai/clearml/blob/9ff52a8699266fec1cca486b239efa5ff1f681bc/clearml/logger.py#L277
the time taken to upload halved. It is puzzling because as you say it's not that much to upload.
Maybe it was the load on the server? meaning dealing with multiple requests at the same time delayed the requests?!
For now I've whittled down the number of entries to a more select but useful few and that has solved the issue. If it crops up again I will try
connect_configuration
properly.
Thanks for your help!
My pleasure π
Done π
PompousBeetle71 quick question, will you ever want to pass an empty string ? reason for asking is that it is either one or the other, there is no way for Trains to actually differentiate (from the web UI, perspective this is just an empty string field...)
SmallDeer34 I have to admit this reference is relatively old, maybe we should update to auther http://clearml.ml (would that make sense ?)
We just donβt want to pollute the server when debugging.
Why not ?
you can always remove it later (with Task.delete) ?
that should have worked, do you want send the log?
restart_period_sec
I'm assuming development.worker.report_period_sec
, correct?
The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.
Let me see if we can reproduce this behavior and quickly fix
I'm with on this one π it better to make a company wide decision on these things and not allow too much flexibility (just two options to choose from, and it should be enough, I think)
Hi JollyChimpanzee19
What are the versions (clearml , TF , PT), also could you add one more line from the stack (I.e. which call triggered the exception)