Reputation
Badges 1
25 × Eureka!@<1699955693882183680:profile|UpsetSeaturtle37> good progress, regrading the error, 0.15.0 is supposed to be out tomorrow, it includes a fix to that one.
BTW: can you run with --debug
do you have a video showing the use case for clearml-session
I totally think we should, I'll pass it along π
what is the difference between vscode via clearml-session and vscode via remote ssh extension ?
Nice! remote vscode is usually thought of as SSH, basically you have your vscode running on your machine, and using SSH vscode automatically connects to the remote machine.
Clearml-Session also ads a new capability VSCode inside your browser, where the VSCode itself as well...
So you have two options
- Build the container from your docker file and push it to your container registry. Notice that if you built it on the machine with the agent, that machine can use it as Tasks base cintainer
- Use the From container as the Tasks base container and have the rest as docker startup bash script. Wdyt?
Can you please elaborate on the latter point? My jupyterhubβs fully containerized and allows users to select their own containers (from a list i built) at launch, and launch multiple containers at the same time, not sure I follow how toes are stepped on. (edited)
Definitely a great start, usually it breaks on memory / GPU-mem where too many containers on the same machine are eating each others GPU ram (that cannot be virtualized)
@<1699955693882183680:profile|UpsetSeaturtle37> can you try with the latest clearml-session (0.14.0) I remember a few improvements there
The remote machine is in Azure behind the load-balancer, we are using docker images, so directly connecting to pods.
yeah LB in the middle might be introducing SSH hiccups, first upgrade to the latest clearml-session it better ocnfigures the SSH client/server to support longer timeout connection, if that does not work try the -- keepalive=true
Le...
Hi @<1727497172041076736:profile|TightSheep99>
I think you are correct! it will use the internal individual file upload retry but does not let you control it.
Could you please open a github issue so that we do not forget to add it?
Hi @<1545216070686609408:profile|EnthusiasticCow4>
My biggest concern is what happens if the TaskScheduler instance is shutdown.
good question, follow up, what happens to the cron service machine if it fails?!
TaskScheduler instance is shutdown.
And yes you are correct if someone stops the TaskScheduler instance
it is the equivalent of stopping the cron service...
btw: we are working on moving some of the cron/triggers capabilities to the backend , it will not be as flexi...
BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
Hi @<1564785037834981376:profile|FrustratingBee69>
It's the previous container I've used for the task.
Notice that what you are configuring is the Default container, i.e. if the Task does not "request" a specific container, then this is what the agent will use.
On the Task itself (see Execution Tab, down below Container Image) you set the specific container for the Task. After you execute the Task on an Agent, the agent will put there the container it ended up using. This means that ...
Hmm if this is case, you can add some prints in here:
None
the service/action will tell you what you are sending
wdyt?
Hi PungentLouse55 ,
I think can see how these magic lines solved it, and I think you are onto something.
Any chance what happened is multiple workers were trying to simultaneously save/load the same Model ?
I can add files to the data set, even after I finish the experiment?
Correct
https://clear.ml/docs/latest/docs/clearml_data#creating-a-dataset
https://clear.ml/docs/latest/docs/guides/data%20management/data_man_cifar_classification
https://github.com/allegroai/clearml/blob/master/docs/datasets.md#create-dataset-from-code
Anyhow from your response is it safe to assume that mixing inΒ
Β code with the core ML task code has not occurred to you as something problematic to start with?
Correct π Actually we believe it makes it easier, as worst case scenario you can always run clearml in "offline" without the need for the backend, and later if needed you can import that run.
That said, regrading (3), the "mid" interaction is always the challenge, clearml will do the auto tracking/upload of the mod...
UI for some anomalous file,
Notice the metrics are not files/artifacts, just scalars/plots/console
we concluded that we don't want to run it through ClearML after all, so we ran it standalone
out of curiosity, what was the conclusion and why?
Thanks BroadSeaturtle49
I think I was able to locate the issue !=
breaks the pytroch lookup
I will make sure we fix asap and release an RC.
BTW: how come 0.13.x have No linux x64 support? and the same for 0.12.x
https://download.pytorch.org/whl/cu111/torch_stable.html
CooperativeSealion8 let me know if you managed to solve the issue, also feel free to send the entire trains-server log. I'm assuming one of the dockers failed to boot...
What's the OS running the server?
Seems like something is not working with the server, i.e. it cannot connect with one of the dockers.
May I suggest to carefully go through all the steps here, make sure nothing was missed
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md
Especially number (4)
CooperativeSealion8
when it first asks me to enter my full name
Where? in the Web?
You can do:task = Task.get_task(task_id='uuid_of_experiment')
task.get_logger().report_scalar(...)
Now the only question is who will create the initial Task, so that the others can report to it. Do you have like a "master" process ?
Yes, as long as the client is served from http://app.something.com it will look for the api server at http://api.something.com
We should probably make sure it is properly stated in the documentation...
Yes, actually the first step would be a toggle button for regexp in the search, the second will be even more advanced search.
May I suggest you post it on the UI suggestion issue https://github.com/allegroai/trains/issues/81 ?