Reputation
Badges 1
25 × Eureka!CrookedWalrus33 from the log it seems the code is trying to use "kwcoco" but it is not listed under any "Installed packages" nor do you see any attempt to install it. Can you confirm ?
Hi SmarmyDolphin68
I see this in between my training epochs, what could be causing this?
This is basically saying we are saving a second model on the same Task and even though both are logged, only the last is stored on the Task itself.
This will change as in the next version a Task will be able to hold reference to multiple models in the artifactory 🙂
Hmm whats the OS and python version?
Is this simple example working for you?
None
JitteryCoyote63 are you suggesting it happens ?
(obviously it should not 🙂 )
that clearml-agent needs to be installed from system python mentioned anywhere in the docs, if not I suggest it gets added.
You are right, I will check and fix if not 🙂
Thank you so much for helping.
My pleasure
Hi @<1571308003204796416:profile|HollowPeacock58>
I'm assuming this is the arm support (i,e, you are running on new mac) fix we released in one one of the last clearml-agent versions. could you update to the latest clearml-agent?
pip3 install clearml-agent==1.6.0rc2
the agent does not auto-refresh the configuration, after a conf file change you should restart the agent, after that it should present the new configuration when loading
Hi @<1523701168822292480:profile|ExuberantBat52>
I am trying to execute a pipeline remotely,
How are you creating your pipeline? and are you referring to an issue with the pipeline logic or is it a component that needs that repo installed ?
SmarmyDolphin68 sadly if this was not executed with trains (i.e. the offline option of trains), this is not really doable (I mean it is, if you write some code and parse the TB 😉 but let's assume this is way to much work)
A few options:
On the next run, use clearml OFFLINE option, (i.e. in your code call Task.set_offline() , or set env variable CLEARML_OFFLINE_MODE=1) You can compress the upload the checkpoint folder manually, by passing the checkpoint folder, see https://github.com...
Of course, I used "localhost"
Do not use "localhost" use your IP then it would be registered with a URL that points to the IP and then it will work
MysteriousBee56 that is very strange definitely explains it, kudos on debugging it !!!
LOL yes 🙂
just make sure it won't be part of the uncommitted changes of the AWS autoscaler 😉
You can always specify diff clearml.conf files with --config-file 🙂
So the only difference is how I log in into machine to start clear-ml
the only different that I can think of is the OS Environments in the two login types:
can you run export
in the two cases and check the diff between them?export
I suspect it's the localhost - and the trains-agent is trying too hard to access the port, but for some reason does not report an error ...
Hmmm.
could you change the api_server:
http://localhost:8008 to your host IP?
for example:api_server:
http://192.168.1.11:8008
ohh right, my bad:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && pip install trains-agent && echo done"
I think the main issue is that for some reason the container running changed one of the files inside the temp folder. then the host machine is "stuck" with a file that the root user owned/changed, and now it cannot reuse / delete the temp folder.
I think the fix is to make sure the container deleted the temp folder when it is done
FiercePenguin76
So running the Task.init from the jupyter-lab works, but running the Task.init from the VSCode notebook does not work?
If I add the bucket to that ....
Oh no .... you should also set SSL off for the connection, but I think this is only in the clearml.conf:
https://github.com/allegroai/clearml/blob/fd2d6c6f5d46cad3e406e88eeb4d805455b5b3d8/docs/clearml.conf#L101
Hi FancyWhale93 you can disable the auto model uploading with@PipelineDecorator.component(..., auto_connect_frameworks={'pytorch': False}) def step(): pass
Hi @<1694157594333024256:profile|DisturbedParrot38>
The dataset ID is also the task ID :)
MysteriousBee56 when you run the trains-agent
with --foreground , before it starts the docker it print the full command line, could you send it please?
I can't figure out where the extra ' came from...
Also could you send the trains.conf file?
(feel free to redact and confidential information)
GleamingGrasshopper63 can you ping to your api server ?!ping api.server.here
Also what's the api server you configured ? (ip:8008 ?)
Any chance this is a Local machine, i.e. the colab machine cannot get back into the clearml server cunning locally ?
So clearml-init can be skipped, and I provide the users with a template and ask them to append the credentials at the top, is that right?
Correct
What about the "Credential verification" step in clearml-init command, that won't take place in this pipeline right, will that be a problem?
The verification test is basically making sure the credentials were copy pasted correctly.
You can achieve the same by just running the following in your python console:
` from clearml import Ta...
Seems lime someone sitting in the middle and reroutes the request (maybe both https and port) ?!
Can you send the full log as attachment?
Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and in...