Yes that’s what I did initially, but eventually I decided that it’s too much complexity added for nothing really, I’d rather drop omegaconf and if one day clearml supports it out of the box take advantage of it
Nope, I’d like to wait and see how the different tools improve over this year before picking THE one 😄
is it different from Task.set_offline(True)?
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd -> wrong numpy version
Just tried, still the same issue
I hit F12 to check projects.get_all_ex but nothing is fired, I guess the web ui is just frozen in some weird state
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 • 1.1.1 • 2.14 . What these numbers correspond to?
extra_configurations = {'SubnetId': "<subnet-id>"}with brackets right?
I will try addingsudo sh -c "echo '\n* soft nofile 65535\n* hard nofile 65535' >> /etc/security/limits.conf"to the extra_vm_bash_script , maybe that’s enough actually
Opened an issue with the logs here > None
btw, I tried with alpine instead of ubuntu:18.04, got :
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pulling fs layer
df20fa9351a1: Verifying Checksum
df20fa9351a1: Download complete
df20fa9351a1: Pull complete
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting containe...
Ha I just saw in the logs:
WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at
ok, and if not the case, it will fall back to 3.8, right? Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
yes, done! Is there something more to take into account than what I shared?
I cannot share the file itself, but here are some potential helpful points:
Multiple lines empty One line is empty but has spaces (6 to be exact) The last line of the file is empty
in the controller, I want to upload an artifact and start a task that will query that artifact and I want to make sure that the artifact exists when the task will try to retrieve it
The clean up service is awesome, but it would require to have another agent running in services mode in the same machine, which I would rather avoid
What is latest rc of clearml-agent? 1.5.2rc0?
line 13 is empty 🤔
Still investigating, task.data.last_iteration is correct (equal to engine.state["iteration"] ) when I resume the training
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
Configuration:
` {
"resource_configurations": {
"v100": {
"instance_type": "g4dn.2xlarge",
"availability_zone": "us-east-1a",
"ami_id": "ami-05e329519be512f1b",
"ebs_device_name": "/dev/sda1",
"ebs_volume_size": 100,
"ebs_volume_type": "gp3",
"key_name": "key.name",
"security_group_ids": [
"sg-asd"
],
"is_spot": false,
"extra_configura...
Are you planning to add a server-backup service task in the near future?
So the new EventsIterator is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? I’ll update the cluster to make sure it’s not the case
with my hack yes, without, no
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue