Reputation
Badges 1
25 × Eureka!curl seems okay, but this is odd https://<IP>:8010
it should be http://<IP>:8008
Could you change and test?
(meaning change the trains.conf and run trains-agent list
)
SuperiorDucks36 from code ? or UI?
(You can always clone an experiment and change the entire thing, the question is how will you get the data to fill in the experiment, i.e. repo / arguments / configuration etc)
There is a discussion here, I would love to hear another angle.
https://github.com/allegroai/trains/issues/230
Hey SarcasticSparrow10 see here π
https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_linux_mac.html#upgrading
So obviously that is the problem
Correct.
ShaggyHare67 how come the "installed packages" are now empty ?
They should be automatically filled when executing locally?!
Any chance someone mistakenly deleted them?
Regrading the python environment, trains-agent
is creating a new clean venv for every experiment, if you need you can set in your trains.conf
:agent.package_manager.system_site_packages: true
https://github.com/allegroai/trains-agent/blob/de332b9e6b66a2e7c67...
ShaggyHare67
Now theΒ
trains-agent
Β is running my code but it is unable to importΒ
trains
Β ...
What you are saying is you spin the 'trains-agent' inside a docker? but in venv mode ?
On the server I have both python (2.7) and python3,
Hmm make sure that you run the agent with python3 trains-agent
this way it will use the python3 for the experiments
ShaggyHare67 could you send the console log trains-agent
outputs when you run it?
Now theΒ
trains-agent
Β is running my code but it is unable to importΒ
trains
Do you have the package "trains" listed under "installed packages" in your experiment?
CrookedWalrus33 I'm testing with the latest RC on a local minio and this is what I'm getting:clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_3by281j8.tmp => 10.99.0.188:9000/bucket/debug/PyTorch MNIST train.8b6edc440cde4469b82e6da17e74c952/models/mnist_cnn.tar clearml.Task - INFO - Waiting to finish uploads clearml.Task - INFO - Completed model upload to
MNIST train.8b6edc440cde4469b82e6da17e74c952/models/mnist_cnn.tar clearml.Task - INFO - Finished uploading
e...
The easiest if export_task / update_task:
https://allegro.ai/docs/task.html#trains.task.Task.export_task
https://allegro.ai/docs/task.html#trains.task.Task.update_task
Check the structure returned by export_task, you'll find the entire configuration test there,
then, you can use that to update back the Task.
BTW:
Partial update is also supported...
I see, is this what you are looking for?
https://allegro.ai/docs/task.html#trains.task.Task.init
continue_last_task='task_id'
They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
You should not have this entry in the conf file, the "worker_id" should be unique (and is based on the "worker_name" as a prefix. You can control it via env variales:CLEARML_WORKER_ID
EnviousStarfish54 you can use Use Task.set_credentials
Notice that OS environment or trains.conf will override the programmatic credentials
https://allegro.ai/docs/task.html#trains.task.Task.set_credentials
WickedGoat98 no need to open any ports on the agent's machine, the agent is polling the clearml-server, so as long as it can reach it, we are good.
EnviousStarfish54 Sure, see scatter2d
https://allegro.ai/docs/examples/reporting/scatter_hist_confusion_mat_reporting/#2d-scatter-plots
That's why I want to keep it as separate tasks under a single pipeline.
Hmm Yes, if this is the case then you definitely have to have two Tasks (with execution info on each one).
So you could just create a "draft" pipeline Task and report everything to it? Does that make sense ?
(By design a pipeline is in charge of spinning the Tasks and pulling the data/metric from them if needed, in your case it sounds like you need the Tasks to push the data/metric onto the pipeline Task, this is ...
YEY π π
Hi GrievingTurkey78 ,
Yes this is a per file download, but I think you can list the bucket and download everything
Try:from trains import StorageManager from trains.storage.helper import StorageHelper helper = StorageHelper.get('gs://bucket/folder') remote_files = helper.list('*') for f in remote_files: StorageManager.get_local_copy(f)
You might need to play around a bit, it might be that StorageHelper.get(' gs://bucket ') and then helper.list('folder/*')
Let me know what worked π
This would be my only improvement, otherwise awesome!!!output_model.update_weights(weights_filename=os.path.join(training_data_path, 'runs', 'train', 'yolov5s6_results', 'weights', 'best.onnx'))
Maybe we should add it to Storage Manager? What do you think?
HandsomeCrow5 Seems like the right place would be in the artifacts, as a summary of the experiment (as opposed to on going reporting), is that the case?
If it is then in the Artifacts tab clicking on the artifact should open another tab with your summary, which sounds like what you were looking for (with the exception of the preview thumbnail π
Hi MelancholyBeetle72 , that's a very interesting case. I can totally understand how storing a model and then immediately renaming it breaks the upload. A few questions, is there a way for pytorch lightning not to rename the model? Also I wonder if this scenario happens a lot (storing model and changing it) . I think the best solution is for Trains to create a copy of the file and upload it in the background. That said the name will still end with .part What do you think?
DeliciousBluewhale87
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there..
Hmm that means it is working...
Do you see there a *.conf files? What do they contain? (it point to the correct clearml-server config)
MelancholyBeetle72 thanks! I'll see if we could release an RC with a fix soon, for you to test :)
MelancholyBeetle72 there is an RC with a fix, check the GitHub issue for details :)
That is quite neat! You can also put a soft link from the main repo to the submodule for better visibility
WickedGoat98 did you setup a machine with trains-agent pulling from the "default" queue ?
Hi DeliciousBluewhale87
When you say "workflow orchestration", do you mean like a pipeline automation ?