Reputation
Badges 1
25 × Eureka!I think prefix would be great. It can also make it easier for reporting scalars in general
Actually those are "supposed" to be collected automatically by pytorch and reported by the master node.
currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.
Also "should" be part of pytorch ddp
It's launched with torchrun
I know there is an integration with torchrun (the under the hood infrastructure) effort, I'm not sure where it stands....
oh...so is this a bug?
It was always a bug, only an elusive one π
Anyhow, I'll make sure we push a fix to GitHub, an RC is planned for later this week, it will contain it
WackyRabbit7 you can configure AWS autoscaler with two types of instances , with priority to one of them. So in theory you do not need two autoscaler processes, with that in mind I "think" single IAM should suffice
Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
withentrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"
JitteryCoyote63 see if upgrading the packages as they suggest somehow fixes it.
I have the feeling this is the same problem (the first error might be trains masking the original error)
so that you can get the latest artifacts of that experiment
what do you mean by " the latest artifacts "? do you have multiple artifacts on the same Task or s it the latest Task holding a specific artifact?
If this is the case, then we do not change the maptplotlib backend
Also
I've attempted converting theΒ
mpl
Β image toΒ
PIL
Β and useΒ
report_image
Β to push the image, to no avail.
What are you getting? error / exception ?
Hi JealousParrot68
clearml tracking of experiments run through kedro (similar to tracking with mlflow)
That's definitely very easy, I'm still not sure how Kedro scales on clusters. From what I saw, and I might have missed it, it seems more like a single instance with sub-processes, but no real ability to setup diff environment for the diff steps in the pipeline, is this correct ?
I think the challenge here is to pick the right abstraction matching. E.g. should a node in kedro (w...
WittyOwl57 what about? vm.max_map_count echo "vm.max_map_count=262144" > /tmp/99-clearml.conf
sudo mv /tmp/99-clearml.conf /etc/sysctl.d/99-clearml.conf
sudo sysctl -w vm.max_map_count=262144
sudo service docker restart `https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac (5)
Hmm we might need more detailed logs ...
When you say there is a lag, what exactly doe s that mean? if you have enough apiserver instances answering the requests, the bottleneck might be the mongo or the elastic ?
SubstantialElk6 try to add -e CLEARML_AGENT_EXTRA_PYTHON_PATH=/code/app/flair
It should add it to the runtime pythonpath
(to the BASE DOCKER IMAGE on the Task itself)
Hi SteadySeagull18
However, it seems to be entirely hanging here in the "Running" state.
Did you set a an agent to listen to the "services" queue ?
Someone needs to run the pipeline logic itself, it is sometimes part of the clearml-server deployment but not a mist
MinuteGiraffe30 if you are running the following command while your current directory is where you code is, what are you getting?
$ git ls-remote --get-url origin
The api server by default spins multiple processes (they all might be busy a tye time with a huge flood of requests, but this is still multi process). Let me check if there is an easy way to set more processes
Okay I think I found the confusion here (and it is confusing, but also very cool)
This line:metrics_names = {"metrics": ["name", "bias", "r2"]} task.connect(metrics_names)When running in "manual mode" (i.e. not by an agent), will take the dict metrics_names and put it on the Tasks HyperParameters section.
But, when executed by the Agent, it will do the opposite! it will take the data stored on the Task's hyperparameters section and put it back into the metrics_names ` variable...
That was the idea behind the feature (and BTW any feedback on usability and debugging will be appreciated here, pipelines are notorious to debug π )
the ability to exexute without an agent i was just talking about thia functionality the other day in the community channel
What would be the use case ? (actually the infrastructure now supports it)
EnviousStarfish54 data versioning on the open source leverages the artifacts and storage and caching capabilities of Trains.
A simple workflow
- Upload data
https://github.com/allegroai/events/blob/master/odsc20-east/generic/dataset_artifact.py - Preprocessing data
https://github.com/allegroai/events/blob/master/odsc20-east/generic/process_dataset.py - Using data
https://github.com/allegroai/events/blob/master/odsc20-east/scikit-learn/sklearn_jupyter.ipynb
Hi @<1637624975324090368:profile|ElatedBat21>
I think that what you want is:
Task.add_requirements("unsloth", "@ git+
")
task = Task.init(...)
after you do that, what are you seeing in the Task "Installed Packages" ?
Hi AdorableFrog70
I assume so, there's API for everything so you can always get the data. wdty?
just to check. Does the k8s glue install torch by default?
SubstantialElk6 what do you mean the glue installs torch ?
The glue will take a Task from the queue create a k8s job (basically use the same docker and inside the docker run get the agent to execute the requested Task). Where would the "torch" come into play?
Whats the trains server IP? It seems everything is configured with local host?
I still can't get it to work... I couldn't figure out how can I change the clearml version in the runtime of the Cleanup Service as I'm not in control of the agent that executes it
Let's take a step back. Let's remove the clearml-services from the docker compose for a second, and run it manually (then you can control everything). Once you have it running manually, let's try to replicate the setup back to the docker compose, make sense ?
@<1523716917813055488:profile|CloudyParrot43> yes server upgrades deleted it π we are redeploying a copy, should take a few min
CooperativeFox72 of course, anything trains related, this is the place π
Fire away
My question is what happens if I launch in parallel multiple doit commands that create new Tasks.
Should work out of the box.
I would like to confirm that current_task ...
Correct.
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive
I see it does make sense.
Two options, one, as you mentioned use the ClearML StorageManager to upload the files, then register them as external links with Dataset.
Two, I know the enterprise tier has HyperDatasets, that are essentially what you describe, with version control over the "metadata" and "raw storage" on the GCP, including the ab...