Reputation
Badges 1
113 × Eureka!may be specific to fastai as I cannot reproduce it with another training using yolov5
About the caching: how does it work ? ClearML maintain it own cache and monitor if any of you code changes? Even code that get change inside an import ?
the weird things is if it's a Azure ACA issue, it would be known right ? There are so many people who use ACA and having ACA talking to each other.
this is really weird ...
@<1523701087100473344:profile|SuccessfulKoala55> Actually it failed now: failed to talked to our storage in Azure:
ClearML Task: created new task id=c47dd71dea2f421db05647a21d78ed26
2024-01-25 21:45:23,926 - clearml.storage - ERROR - Failed uploading: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2024-01-25 21:46:48,877 - clearml.storage - WARNING - Storage helper problem for .clearml.0149daec-7a03-4853-a0cd-a7e2b295...
In summary:
Spin down the local server
Backup the data folder
In the cloud, extract the data backup
Spin up the cloud server
following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not
if you are on github.com , you can use Fine tune PAT token to limit access to minimum. Although the token will be tight to an account, it's quite easy to change to another one from another account.
@<1523701087100473344:profile|SuccessfulKoala55> it is set to "all" as :
NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=qua...
I don't think agent are aware of each other. Which mean that you can have as many agent as you want and depending on your task usage, they will be fighting for CPU and GPU usage ...
but afaik this only works locally and not if you run your task on a clearml-agent!
Isn;t the agent using the same clearml.conf ?
We have our agent running task and uploading everything to Cloud. As I said, we don;t even have file server running
@<1523701087100473344:profile|SuccessfulKoala55> Should I raise a github issue ?
from what I understand, the docker mode were designed for apt based image and also running as root inside the container.
We have container that are not apt based and running not as root
We also do some "start up" that fetch credentials from Key Vault prior running the agent
you should be able to explicitly upload a file of your choice as artefact using something like this: None
Oh, I was assuming you are passing the entire DB backups to the cloud.
Yes, that is what I want to do.
So I need to migrate both the MongoDB database and elastic search database from my local docker instance to the equivalent in the cloud ?
but when I spin up a new server in the cloud, that server will have it's own mongodb and that will be empty no ?
following this thread as it happen every now and then that clearml miss some package for some reason ...
you should be able to test your credential first using something like rclone or azure-cli
just saw that repo: who are coder ? That not the vscode developer team is it ?
then dont use clearml to look at images
I don't think ClearML is design to vizualize millions of image per task. At least not the Debug samples section. That was design so that you can see for a given set of image, how does the model perform epoch after epoch.
For vizu millions of image, you have tool like Fiftyone.
To "attach" that zip to the model, do you just use the update_weight and point to that zip file?
so what was the solution/hack then ?
Are the uncommit changes in un-tracked files ?
In other words: clearml will only save uncommited changes from files that are tracked by your local git repo
there is a tricky thing: clearml-agent should not be running from a venv itself ... don't remember where I read that doc
I don't see where you instanciate ClaerML Task in your given code. Which means that Task.current_task() will return None , thus the error you get.
We don't have a file server. The clearml conf have :sdk.development.default_output_uri=" None "
You can use single PC and have multi agent running in the same time, each assigned one or multi GPU.
You likely to hit CPU bottleneck, depending on how much augmentation you are applying when training ....
We use task.export_task() and a hacked version to get console log:
def save_console_log(task: clearml.Task, fs, remote_path, number_of_reports=10000):
from clearml.backend_api.services import events
from clearml.backend_api import Session
# Stollen from Task.get_reported_console_output()
if Session.check_min_api_version('2.9'):
request = events.GetTaskLogRequest(
task=task.id,
order='asc',
navigate_earlier=True,
...
my code looks like this :
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config-file', type=str, default='train_config.yaml',
help='train config file')
parser.add_argument('-t', '--train-times', type=int, default=1,
help='train the same model several times')
parser.add_argument('--dataset_dir', help='path to folder containing the preped dataset.', required=True)
parser.add_argument('--backup', action='s...
When i set output uri in the client, artefact are sent to blob storage
When file_server is set to azure:// then model/checkpoint are sent to blob storage
But the are still plot and metrics folder that are stored in the server local disk. Is it correct?
