Reputation
Badges 1
25 × Eureka!Hmm good point, it should probably return he clearml python version. Is this what you mean?
Yep... some went wrong with the elastic container, I think it lost it's indexes (or they got screwed somehow)
Do you have a backup of the persistence volume attached to the container? Can you try restoring it?
I would restart the entire clearml-server (docker-compose), then can you post here the startup logs? It should provide some info on what's wrong
Hi AgitatedTurtle16
My question is how to use it to manage my experiments (docker containers). Simply put, let's say:
So basically once you see an experiment in the UI, it means you can launch it on an agent.
There is No need to containerize your experiment (actually that's kind of the idea, removing the need to always containerize everything).
The agent will clone the code, apply uncommitted changes & install the packages in the base-container-image at runtime.
This allows you to u...
Could it be you have some custom SSL certificate installed, or policy ?
can you get other https sites? (for example your clearml-server)
Let's start small. Do you have grafana enabled in your docker compose and can you login to your grafana web ui?
Notice grafana needs to access the prometheus container directly so easiest way is to have everything in the same docker compose
DeliciousBluewhale87
node.base_task_id
Β is the base task, which will always be in draft mode, Instead we should use theΒ
node.executed
Β which references the current executed node.
YES, maybe we should add that into the example, so it is clearer ? WDYT?
(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:
` trains-agent execute --id <...
GrievingTurkey78 short answer no π
Long answer, the files are stored as differentiable sets (think changes set from the previous version(s)) The collection of files is then compressed and stored as a single zip. The zip itself can be stored on Google but on their object storage (not the GDrive). Notice that the default storage for the clearml-data is the clearml-server, that said you can always mix and match (even between versions).
it handles 2FA if my repo lies in Github and my account needs 2FA to sign in
It does not π
Hi @<1561885921379356672:profile|GorgeousPuppy74>
Please use threads to ask questions, so we keep everything tidy
(and if you can please remove your first message, and merge it with the above one, this one and edit this one, for better readability)
regrading the issue, you need to either have clearm.conf in your Home folder, I'm assuming thisis /root/
not /home/ubuntu/.
Also not sure why you need to expose ports...
Obviously if you click on them you will be able to compare based on specific metric / parameters (either as table or in parallel coordinates)
Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)
BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support π
Makes sense to add it to docker run by default if GPUs are mentioned in agent.
I think this is an arch thing, --privileged is not needed on ubuntu flavor, that said you can always have it if you add it here:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
clearml-agent daemon --gpus 0 --queue default --docker
But docker still sees all GPUs.
Yes --gpus should be enough, are you sure regrading the --privileged flag ?
VivaciousPenguin66 I have the feeling it is the first space in the URI that breaks the credentials lookup.
Let's test it:from clearml import StorageManager uri = '
` Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'
original
StoargeManager.get_local_copy(uri)
qouted
StoargeManager.get_local_copy(uri.replace(' ', '%20')) `
Hi @<1532532498972545024:profile|LittleReindeer37>
This is truly a great discussion to have. Personally I think the main difference is that software development is a somewhat linear process , and git captures it very well. But ML is a lot wider nonlinear process, which to me means that trying to conform the same workflow into a Dev tree will end up failing. The way ClearML thinks about it (and I think the analogy to source control is correct ) is probably closer to how you think about proj...
Hi YummyFish22
Looks like the task does not have "Task.init" call on the main script (or an import of clearml)? could that be the case?
This works.
great!
So it is still in master and should be included in 1.0.5?
correct, RC will be released soon with this fix included
he problem is due to tight security on this k8 cluster, the k8 pod cannot reach the public file server url which is associated with the dataset.
Understood, that makes sense, if this is the case then the path_substitution
feature is exactly what you are looking for
Sounds like something very similar, I'll try to use it,
You can set it per container with -e CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1
Or add it here:
https://github.com/allegroai/clearml-agent/blob/51eb0a713cc78bd35ca15ed9440ddc92ffe7f37c/docs/clearml.conf#L149extra_docker_arguments: ["-e", "CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1"]
Thanks @<1657918706052763648:profile|SillyRobin38> this is still in the internal git repo (we usually do not develop directly on github)
I want to get familiar with it and, if possible, contribute to the project.
This is a good place to start: None
we are still debating weather to sue it directly or as part of Triton ( None ) , would love to get your feedback
Hi MagnificentSeaurchin79
This means the tensorflow was not directly imported in the repository (which is odd, it might point to the auto package analysis failing to find a the package, if this is the case please let me know)
Regardless, if you need to make sure a package is listed in the requirements either import it or use.Task.add_requirements('tensorflow')
or Task.add_requirements('tensorflow', '2.3.1')
This will set more time before the timeout right?
Correct.
task.freeze_monitor()
download()
task.defrost_monitor()
Currently there isn't, but that's a good ides.
What would be the argument of using it vs increasing the timeout ?
btw: setting the resource timeout to 99999 will basically mean that it will wait until the first reported iteration, Not that it will just sleep for 99999sec π
Hi GiganticTurtle0
You can keep clearml following the dictionary auto updating the UI
args = task.connect(args)
Could I use "register artifact"
I think this is somewhat deprecated and we should probably replace it with something similar to what you mentioned (i.e. watch a file change).
Right now the easiest way would e to manually upload the trainer_state.json
every checkpoint:Task.current_task().upload_artifact('trainer_state.json
, name='state') `
NaughtyFish36
what's the error you are getting?
Also did you try setting: force_git_ssh_protocol: true
?
https://github.com/allegroai/clearml-agent/blob/76c533a2e8e8e3403bfd25c94ba8000ae98857c1/docs/clearml.conf#L39
Hi @<1523702932069945344:profile|CheerfulGorilla72>
Please tell me what RAM metric is tracked by ClearML?
Free RAM is the entire machine free RAM
Yeah htop shows odd numbers as it doesn't "count" allocated buffers
specifically you can see the code here:
None