Reputation
Badges 1
25 × Eureka!Hi WackyRabbit7
First always check the functions on the Task object, they are the most straight forward access to the system.
Then if you need general purpose API calls, currently they are only documented in the doc-string of the API schema (that said it should be quite documented)
You can check all the endpoints https://github.com/allegroai/trains/tree/master/trains/backend_api/services/v2_8
And finally if you want to easily use the RestAPI :
` from trains.backend_api.session.client impo...
ReassuredTiger98 can you send the full log?
Also, what's the clearml-agent version?
fyi: we fixed an issue where the default order of the conda repositories cause pytorch to be installed form the conda forge instead of the pytorch repo, making it the cpu version instead of the gpu version:
This is the correct conda repo orderL
https://github.com/allegroai/clearml-agent/blob/cb6bdece39751eaef975287609b8bab603f116e5/docs/clearml.conf#L66
I gather there's a distinction between the two, with app.clear being the public cloud-based SaaS version
My apologies SmallDeer34 , this is all some legacy domain stuff
actually " http://app.pro.clear.ml ," is not used any longer (although up), and will be removed in the future
SaaS free/pro is the same domain ( http://app.clear.ml ), same accounts, the only difference is whether you added a credit card, other than that it is the same domain and access.
does that make sense ?
If that's the case check the free space in the monitoring of the experiment, you will find the free space in GB logged
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
JitteryCoyote63 any chance the trains-agent-1
is running in services mode ?
Which means it will spin more than a single experiment at once
Hi RoughTiger69
unfortunately, the model was serialized with a different module structure - it was originally placed in a (root) module called
model
....
Is this like a pickle issue?
Unfortunately, this doesnβt work inside clear.ml since there is some mechanism that overrides the import mechanism using
import_bind
.
__patched_import3
What error are you getting? (meaning why isn't it working)
It just seems frozen at the place where it should be spinning up the tasks within the pipeline
And is there an agent for those ? usually there is one agent for running logic tasks (like pipelines) running with --services-mode
which means multiple Tasks can be executed by the same agent. And other agents for compute Tasks that are a signle Task per agent (but you can run multiple agents on the same machine)
Sounds good, I assumed that was the case but I was not sure.
Let's make sure that in the clearml.conf
we write it in the comment above the use_credentials_chain
option, so that when users look for IAM roles configuration they can quick search for it π
OddAlligator72 FYI, in you current code you can always doif use_trains: from trains import Task Task.init()
Might be easier π
MysteriousBee56 Okay, let's try this one:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && echo done"
os.system
Yes that's the culprit, it actually runs a new process and clearml
assumes that there are no other scripts in the repository that are used, so it does not analyze them
A few options:
Manually add the missing requirement Task.add_requirements('package_name')
make sure you call it before the Task.init
2. import the second script from the first script. This will tell clearml to analyze it as well.
3. Force the entire clearml to analyze the whole repository: https://g...
Hi @<1657918706052763648:profile|SillyRobin38>
I'm curious to know if it's possible to prevent uploading a duplicate endpoint.
...and we attempt to upload it again without any changes to the command content,
Basically you overwrite it, and yes, possible π
any other aspect, could the system prevent the duplicate upload?
so basically check the hash and say, no need to upload?
Hmm so the SaaS service ? and when you delete (not archive) a Task it does not ask for S3 credentials when you select delete artifacts ?
I'm checking the possibility of our firewall between the
clearml-agent
machine and the local computer running the
session
Maybe... the thing is, how come the session creates a Task, push it into the queue, but the Task itself is empty.
Hence my request for the clearml-session console log, like actual copy paste of what you have in the terminal, not the Task log from the UI
. Does
Task.connect
send each element of the dictionary as a separate api request? Has anyone else encountered this issue?
Hi SuperiorPanda77
the task.connect ends up as a single call with all the data being sent on a single request.
That said, maybe the connect dict is not the best solution for thousand key dictionary ...
Maybe artifact, or connect_configuration are better suited ?
wdyt?
- In a notebook, create a method and decorate it by fastai.scriptβs
@call_parse
.Any chance you have a very simple code/notebook to reference (this will really help in fixing the issue)?
GrievingTurkey78 I see,
Basically the arguments after the -m src.train
in the remote execution should be ignored (they are not needed).
Change the m in the Args section under the configuration. Let me know if it solved it.
Could you send me the cosnole log of both tasks, failing and passing one?
eval
Β built-in. wdyt?
eval
is never recommended as basically you could do Args/float='os.system("rm ...")'
π
In theory type is stored on the hyper parameter (this is a relatively new feature the backend supports)
The casting though, is done based on the Original value type, which means Task.connect needs to be called with the original dict. Is there a specific reason for using get_parameters instead of task.connect ?
BitingKangaroo95 nice work π
I think that what did it was:
change the sshd_config
so that it allows port forwarding
, agent forwarding
and x11 forwarding
But just in case, it might be there was a pre existing SSH identifier on your machine, and hence the error.
clear known_hosts under ~/.ssh was also something I would try π
How does this work in the context of a pipeline?
Is your pipeline from functions / decorators ? or is it from Tasks ?
(if this is Tasks then just changing the entry point in the overides)
In case of functions or decorators, you have to do that manually (i.e. your function needs to do "accelerate launch"
from accelerate.commands.launch import launch_command, launch_command_parser
parser = launch_command_parser()
args = parser.parse_args("-command -here".split())
launch_command(arg...
What's the difference between the example pipeeline and this code ?
Could it be the "parents" argument ? what is it?
@<1540142651142049792:profile|BurlyHorse22> do you mean the one refereed in the video ? (I think this is the raw data in kaggle)
WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent
readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent
Is there a way to document these non-standard entry points
@<1541954607595393024:profile|BattyCrocodile47> you should see the "run" in the Args section under Configuration
in case of HF you should see "-m huggingface" and then the rest in the Args section
(if this does not work, then I assume this is a bug π )
The idea is of course that you can always enqueue and reproduce, so if that part is broken we should fix it π
When looking at the worker details, it says "No queues currently assigned to this worker"
Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?
Hmm, conda_freeze
in the clearml.conf on the development machine ?
Hi CurvedHedgehog15
Yes you are correct, plots are displayed side-by-side in the ui. The reason is that since they are very generic, it is very challenging to actually be able to merge / overlay two arbitrary plots.
I can see two options
- To allow user to combine two plots in the ui (this way the responsibility is on the user to understand this is possible
- Maybe add programmatic interface to more easily access the raw data?
Wdyt?