Reputation
Badges 1
25 × Eureka!. Is there any known issue with amazon sagemaker and ClearML
On the contrary it actually works better on Sagemaker...
Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?
DeterminedToad86 were you running a jupyter notebook or a jupyter console ?
Yey!
My pleasure π
I mean clone the Task in the UI (right click Clone), then go to the execution Tab, to the "installed packages" section, then click on Edit -> go to the torchvision http link, and replace it with torchvision == 0.7.0
and save.
Then right enqueue the Task (to the default queue) and see if the Agent can run it,
DeterminedToad86 Make sense ?
LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance π
BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space π
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:system_site_packages: true
Itβs the correct way to do it, right?
Yep π that said this is not running as a service you will need to spin it on your machine. that said you can definitely connect it with the free SaaS server, and spin the serving on your machine with docker-compose
ReassuredTiger98 All that said, how about opening an Issue on GitHub (feature request)? if we get a bit of support from users, we could definitely add it
What's the "working dir" ? (where in the repo the script is executed from)
No worries, condatoolkit is not part of it. "trains-agent" will create a new clean venv for every experiment, and by default it will not inherit the system packages.
So basically I think you are "stuck" with the cuda drivers you have on the system
You can switch to docker-mode for better control over cuda drivers, or use conda and specify cudatoolkit (this feature will be part of the next RC, meanwhile it will install the cudatoolkit based on the global cuda_version).
JitteryCoyote63 There is a basic elastic license that should always be there. If for some reason it was deleted/expired then the following command should fix it:
curl -XPOST ' http://localhost:9200/_xpack/license/start_basic '
- Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)
This is exactly what I did here, and it is working π
https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution
Multi-threaded multi-processes multi-nodes π
Hi RobustHippopotamus53
The way "latest from branch" works:
On the Task you specify the branch name (e.g. "master", no need to add the origin/ prefix) The agent then pulls the latest commit from that branch and updates back the Task to the current commit ID (the latest on the branch at the time of execution) This process ensures reproduciblity and traceability as we can always be certain the exact commit that was executed.Could it be the you "forced-push" a commit/squash, hence the "origina...
Hi ShinyWhale52
Every execution of the pipeline (by definition) will create a new job based on the pipeline steps
This is the reason you see all the steps twice (the default assumption is you wish to re-run the step, as this is part of the processing workflow (e.g. training a model)
the model has been overwritten. I guess this is due to this instruction:
This is because you are storing it locally to the same path, it just reflects the fact you just overwrote your model.
To create a...
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production?
(I'm assuming event-base, you mean triggered by events not streaming data, i.e. ETL etc)
I know of at least a few large organizations doing tat as we speak so I cannot see any reason not to.
Thatβd include monitoring and alerting. Iβm afraid that Metaflow will look far more compelling to our teams for that reason.
Sure, then use Metaflow. The main issue with Metaflow...
Hi @<1552101458927685632:profile|FreshGoldfish34>
self-hosted, you mean the open source ? if so, then yes totally free π
That said I would recommend to have the server inside your VPN, just in case from a security perspective
Hi JitteryCoyote63
Do you have a specific example in mind ?
if I encounter the need for that, I will adapt and open a PRΒ
Great!
ChubbyLouse32 could it be the configuration file is not passed to the agent machine itself ?
(were you able to run anything against this internal server? I mean to connect to it from code, clearml/cleamrl-agent) ?
I update my-private-dep to 1.8.0
Not sure how this is connected with the venv, could you expand ?
Hi @<1624941407783358464:profile|GrievingTiger47>
I think you should try to contact the sales guys here: None
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
It's building a gpu enabled docker...
you might want a diff container or to specific --cpu-only