Reputation
Badges 1
25 × Eureka!CloudyHamster42 what's the trains-server version ?
(I am not an expert on UI to be honest)
Same here π lol
we can implement this externally
What do you mean by that?
I'll make sure we have conda ignore git:// packages, and pass them to the second pip stage.
Hi CluelessFlamingo93
I think the latest clearml-agent 1.5.1 fixed that issue (this is basically pip trying to "protect" you from mismatch packages)
can you upgrade your clearml-agent and test?pip3 install clearml-agent==1.5.1
The package detection is done when running the code on your laptop, and this is when it first logs the packages and versions. Following it, what do you have on your laptop? OS/Conda/Python
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?
upload_artifact
will actually do two things:
upload the file to the trains-server register it as an artifact on the experiment
What did you mean by "register the artifact manually"? You still need to upload the file to the trains-server (so it is later accessible )
PunySquid88 do you want to test a fix?
Anyhow if the StorageManager.upload was fast, the upload_artifact is calling that exact function. So I don't think we actually have an issue here. What do you think?
Do you have a specific numpy version you are installing ? why is it trying to install the wheel from code?
maybe I should use explicit reporting instead of Tensorboard
It will do just the same π
there is no method for settingΒ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
DrabCockroach54 that is quite cool!
Basically here is what I would do
Query Tasks that are both Running and Do not have system tag "development" (that means running on agents) + filter only tasks that start say 10 min ago Go over the list and see if (1) they have GPU scalar reported (meaning GPU is accessible) (2) min/max/val of GPU utilization is under 5%wdyt?
None of them is problematic, this is what I'm trying to say π
I think the minio browser gets confused.
if you want to test the upload time on the client you can try:task.flush(wait_for_uploads=True) tic = time() task.upload_artifact('test', '/tmp/localfile') task.flush(wait_for_uploads=True) print(time() - tic)
GrievingTurkey78 Actually it is in progress, see the GitHub issue for details:
https://github.com/allegroai/trains/issues/219
Thanks SparklingHedgehong28
So I think I'm missing information on what you call "Instance protection" ?
You mean like respining spot instances ? or is it away to review the performance of AWS ASG (i.e. like a watchdog of a sort) ?
Oh no, you are absolutely correct, it is broken (I mean I have no idea why it lists Hydra, or how it got there). I will let the guys know and fix it.
Bottom line, after you clone it, please edit the installed packages and remove the "Hydra" line and replace with just "hydra-core" (no need for version).
The format is the same as "requirements.txt" and will effect the venv created by the agent
With pleasure, I'll make sure we officially release RC1 soon :)
I think I found something, let me dig deeper π
I see the problem now: conda is failing to install the package from the git, then it reverts to pip install, and pip just fails... " //github.com/ajliu/pytorch_baselines "
Wait, it shows "hydra==2.5" not "hydra-core==x.y" ?
Hi JumpyPig73 , I think it was synced to github. You can already test with: git install git+ https://github.com/allegroai/clearml.git
It seems something is wrong with the server itself...
Please send the full log, I just tested it here, and it seems to be working
I'm getting:hydra_core == 1.1.1
What's the setup you have? python version, OS, Conda yes/no?
It just seems frozen at the place where it should be spinning up the tasks within the pipeline
And is there an agent for those ? usually there is one agent for running logic tasks (like pipelines) running with --services-mode
which means multiple Tasks can be executed by the same agent. And other agents for compute Tasks that are a signle Task per agent (but you can run multiple agents on the same machine)
Unfortunately not, the queues tab shows only the number of tasks, but not resources used
in the queue
Oh, yes, that makes sense to add, I like that π
(the main question is what data is there in the backend DBs, let me know what I can get)
SteadySeagull18 btw: in post-callback the node.job will be completed
because it is a called after the Task is completed