Reputation
Badges 1
103 × Eureka!if i run clearml-agent daemon that reads from ~/clearml.conf , right?
yes, i can do this again. i did use clearml-agent init to generate clearml.conf after generating a fresh set of keys
the issue also may have been fixed somewhere between 20.1 and 22.2, i didn’t test versions in between those two
looks like a previous user set CLEARML_API_ACCESS_KEY and CLEARML_API_SECRET_KEY in /etc/environment and then disabled the keys in the web app. I removed the two items from /etc/environment and was able to successfully start a worker.
it seems, though, that the env vars take precedence even when a --config-file is explicitly specified?
hi SubstantialElk6 , not sure if you were successful on this but i struggled with it as well, and it looks like the information is not in the linked document anymore.
in the end i realized that i needed to download apiserver.conf from the clearml-server repo ( https://github.com/allegroai/clearml-server/blob/master/apiserver/config/default/apiserver.conf ) and then add a user/pass for myself (starting at line 82).
i updated the token in ~/clearml.conf , was careful to ensure it was only specified in one place
should be posted in the “uncommitted changes” section 🙂
weird. will move forward with manually recreating the task.
yes. had to sanitize it a bit, but left the git username/key intact (since the key is invalid now)
sorry for the delay, had work and personal emergencies 😕
well, as generated by clearml-agent init —i pasted the text directly from the web app into the CLI interface, and it generated clearml.conf
in the main script, these are the first imports:import argparse import time import json import pytorch_lightning as pl from pytorch_lightning.accelerators import acceleratorthen after that we import stuff from the repo, and the listed packages are imported in those files
i don’t get why the agent init log would list the username from clearml.conf but then use the env vars
here’s a the file with the keys and IP redacted: https://clearml.slack.com/files/U01PN0S6Y67/F0231N0GZ19/clearml.conf
(also, the training code, which uses pandas, worked)
getting different issues (torchvision vs. cuda compatibility, will work on that), but i’m betting that was the issue
okay, i have a few things on my todo list, they will take a while. we will task.init in the entry point instead of how it’s done now, and we will re-try python -m . if it doesn’t work, we will file an issue. if it does work, yay!
either way, thanks much for your help today, i really appreciate it.
okay, so if i set set_default_upload_destination as URI that’s local to the computer running the task (and the server):
- the server is “unable to load the image”—not surprising because the filesystem URI was not mounted into the container
- the files are present at the expected location on the local filesystem, but they are…blank! all white.that tells me that
report_mediamight have been successful, but there’s some issue …encoding the data to a jpeg?
yeah, it’s in one of the imports from the repo
don’t want to pester, but i am curious—did they have some thoughts on what was happening? should i make a feature request somewhere?
okay, so here’s what i found out—
calling the training entry point directly (eg /path/to/train.py ), and not instantiating the clearml Task in train.py (eg calling a method in a different module where the task is instantiated) does work calling the entrypoint with python -m , but instantiating the clearml Task within train.py also works
so the only thing that doesn’t work is calling the entrypoint with python -m and calling a method from a different module that ...
ah, my mistake, that’s an issue in my conf file.
actually yes— task.init is called inside of a class in one of the internal imports
but hmm, report_media generates a file that is 0 bytes, whereas report_image generates a 33KB file
ugh, turns out i had a plt.show() in there, that was causing blank figs.
that said, report_matplotlib_figure did not end up putting anything into “plots” or “debug samples”
okay, so my problem is actually that using a “local” package is not supported—ie i need to pip install the code i’m running and that must correctly specify its dependencies
yes, that call appeared to be successful—had to wrap in quotes because of the contents of the key:$ curl -u 'J9*****':'R2*****' `
{"meta":{"id":"6db9ae72249f417fa2b6b8705b44f38a","trx":"6db9ae72249f417fa2b6b8705b44f38a","endpoint":{"name":"users.get_current_user","requested_version":"2.13","actual_version":"1.0"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":null,"error_data":{}},"data":{"user":{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b","name":"trains"},...
$ conda list | grep pandas geopandas 0.9.0 pyhd8ed1ab_1 conda-forge geopandas-base 0.9.0 pyhd8ed1ab_1 conda-forge pandas 1.3.3 py39hde0f152_0 conda-forge
there were env variables set in .zshrc