
Reputation
Badges 1
979 × Eureka!Try to spin up the instance of that type manually in that region to see if it is available
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
I also did run sudo apt install nvidia-cuda-toolkit
Indeed, I actually had the old configuration that was not JSON - I converted to json, now works 🙂
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
That doesn’t work unfortunately
Thanks AgitatedDove14 ! I created a project with a default output destination to a s3 bucket but I don't have local access to this bucket (only agents have access to it for security reasons). Because of that, I cannot create a task in this project programmatically locally because it tries to access the bucket and fails. And there is no easy way to change the default output location (not in the web UI, not in the sdk)
Will it freeze/crash/break/stop the ongoing experiments?
And after the update, the loss graph appears
Looking at the source code, it seems like I should do:data_processing_task._artifact_manager.flush()
to make sure to have the latest version of artifacts in the task, right?
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
AgitatedDove14 Same problem with clearml==1.1.5rc2
😞 , I also tried with backend==gloo
, still same problem
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
I am not sure what you mean by unless the domain is different
? Personal Access Token are designed such that to allow cloning a private repo, the user has to give the PAT full access to repos, including public repos. So it should also work with all other git repos
I mean that I have a taskA (controller) that is in charge of creating a taskB with the same argv parameters (I just change the entry point of taskB)
I mean, inside a parent, do not show the project [parent] if there is nothing inside
I finally found a workaround using cache, will detail the solution in the issue 👍
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described 👍
Should I try to disable dynamic mapping before doing the reindex operation?
Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly 🙂
awesome 🎉
Maybe then we can extend task.upload_artifact
?def upload_artifact(..., wait_for_upload: bool = False): ... if wait_for_upload: self.flush(wait_for_uploads=True)
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit