Reputation
Badges 1
25 × Eureka!Hi @<1523701868901961728:profile|ReassuredTiger98> when you get to it...
please download the wheel, then install it with
pip3 install -U clearml_agent-0.17.3rc0-py3-none-any.whl
Then run the daemon with the additional --debug
argument, basically:
clearml-agent --debug daemon --foreground ...
Once the agent is running please send the Task's log from your console 🙂
Wtf? can you try with = (notice single not double)?
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Hi @<1523701868901961728:profile|ReassuredTiger98>
This should have worked, and seems like conda is not fetching the correct pytorch version (even though the conda env contains the cuda version they specify)
Let's try something, reset the Task, then edit the "Installed packages" and add:
cudatoolkit==11.1.1
Then try again.
Let's see what we get.
(The idea, is that I think conda forgets it just install cudatoolkit and assumes the env is for CPU)
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
-...
Apparently the error comes when I try to access from
get_model_and_features
the pipeline component
load_model
. If it is not set as pipeline component and only as helper function (provided it is declared before the components that calls it (I already understood that and fixed, different from the code I sent above).
ShallowGoldfish8 so now I'm a bit confused, are you saying that now it works as expected ?
This is very odd, can you also put here the file names? maybe an odd character is causing it?
Can you also test it with the latest clearml version (1.8.0) ?
ReassuredTiger98
Okay, but you should have had the prints ...uploading artifact
anddone uploading artifact
So I suspect something is going on with the agent.
Did you manage to run any experiment on this agent ?
EDIT: Can you try with artifacts example we have on the repo:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
is there something else in the conf that i should change ?
I'm assuming the google credentials?
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/docs/clearml.conf#L113
ShallowGoldfish8 how did you get this error?self.Node(**eager_node_def) TypeError: __init__() got an unexpected keyword argument 'job_id'
Hi @<1523706266315132928:profile|DefiantHippopotamus88>
The idea is that clearml-server acts as a control plane and can sit on a different machine, obviously you can run both on the same machine for testing. Specifically it looks like the clearml-sering is not configured correctly as the error points to issue with initial handshake/login between the triton containers and the clearml-server. How did you configure the clearml-serving docker compose?
I was unable to reproduce, but I added a few safety checks. I'll make sure they are available on the master in a few minutes, could maybe rerun after?
I had no idea it was going to do that and sent your servers over 1.4M API hits unintentionally
Yeah, that is way too much, I think relates to the frequency it updates the console 😞
(Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac
Where is the code running (agent) GCP instance ? your machine ?
Try the following example.env
:
CLEARML_SERVING_PORT=9090
CLEARML_WEB_HOST="http://<IP>:8080"
CLEARML_API_HOST="http://<IP>:8008"
CLEARML_FILES_HOST="http://<IP>:8081"
(I think the localhost is resolved to inside the container and not the host machine, hence the error)
@<1523706266315132928:profile|DefiantHippopotamus88> seems like you are missing the ports 🙂
CLEARML_WEB_HOST="
"
CLEARML_API_HOST="
"
CLEARML_FILES_HOST="
"
WackyRabbit7 my apologies for the lack of background in my answer 🙂
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is proba...
I think so (you can also comment out the Task.init() just to verify this is not a clearml issue)
The confusion matrix shows under debug sample, but the image is empty, is that correct?
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
Are you saying that in the UI you do not see "confusion matrix" at all, only on the GS bucket ?
Alternatively I understand I can also run the agent using...
No you should not if you are running the agent inside a container it cannot work in docker mode and spin its own containers
Bottom line use clearml-agent daemon
Local IP, like 192.168.1.123
BTW: in your code, you should probably replacedataset_task = Task.get_task(task_id=dataset.id)
with:dataset_task = dataset._task