Maybe the only thing to worry about is making sure the IP address is stable, so if k8s replaces the node, you do not have to reconfigure the clients 🙂
Hi @<1551376687504035840:profile|StraightSealion9>
AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.
Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?
UI for some anomalous file,
Notice the metrics are not files/artifacts, just scalars/plots/console
TrickyRaccoon92 I'm not sure I follow, TB do show? and you want to add additional plotly plot ?
- Maybe we should add an option, archive components as well ...
okay the odd thing git ls-remote --get-url origin
should have returned the same...
what's your git version? (git --version)
Hi @<1523701504827985920:profile|SubstantialElk6>
I would split the first stage into two. The first one passing data to the others, the second as "monitoring ", Wdyt?
Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:torch == 1.11.0+cu113
into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)
Hi RattyBat71
Do you tend to create separate experiments for each fold?
If you really want to parallelized the workload, then splitting it to multiple executions (i.e. passing an argument of the index of the same CV) makes sense, then you can compare / sort the results based on a specific metric. That said if speed is not important, just having a single script with multiple CVs might be easier to implement?!
However, there is still a delay of approximately 2 minutes between the completion of setup,
Where is that delay in the log?
(btw: it seems your container is missing clearml-agent & git, installing those might add some time)
Hi TightElk12
Are you looking for a way to set the output_uri
from environment variable ? Is this it?
I was wondering about what i can do with the agent's argparse magic
You mean how to pass arguments to components a pipeline? btw did you check the pipeline example here?
None
Sorry, you are correct this is where the json is created:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470
other links are the function calling it. make sense ?
So the way it will work, is you will also need to have a Task.init in main process (the one using the launch function) and the same Task.init in the main_func. What it does is it signals the sub processes to use the main process task. This way they all report to the same task. Obviously to test it you will need to wait for the RC (after the weekend :)
Is this like a local minio?
What do you have under the sdk/aws/s3 section
?
but this gives me an idea, I will try to check if the notebook is considered as trusted, perhaps it isn't and that causes issues?
This is exactly what I was thinking (communication with the jupyter service is done over http, to localhost, sometimes AV/Firewall software will block it, false-positive detection I assume)
ThickDove42 Windows also works 😞
Any specifics on the setup?
OutrageousGiraffe8 so basically replacing to:self.d1 = ReLU()
WickedGoat98
The webUI will look like the demo server 🙂https://demoapp.trains.allegro.ai/
2. curl http://server-ip:8008 should return something like:{"meta":{"id":"78a9dc77081348e2930d1f429fd7e092","trx":"78a9dc77081348e2930d1f429fd7e092","endpoint":{"name":"","requested_version":1.0,"actual_version":null},"result_code":400,"result_subcode":0,"result_msg":"Invalid request path /","error_stack":null},"data":{}}%
3. curl http://server-ip:8080 should return something like:
` <!d...
that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)
As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞
I can't seem to find a difference between the two, why would matplotlib get listed and pandas does not... Any other package that is missing?
BTW: as an immediate "hack" , before your Task.init
call add the following:Task.add_requirements("pandas")
trains-agent RC (which they tell me will be out tomorrow) will have a switch to do that, just so it is easier 🙂
is "my_package" a local package ?
what is the output of:pip freeze | grep my_package
Hmm let check again something.
Hi RotundHedgehog76
I think it should work out of the box, I mean at the end both spin jupyter notebooks, which is what clearml interacts with. Are you getting any errors?
It seems stuck somewhere in the python path... Can you check in runtime what's os.environ['PYTHONPATH']
Oh, and good job starting your reference with an author that goes early in the alphabetical ordering, lol:
LOL, worst case it would have been C ... 🙂
Hi @<1699955693882183680:profile|UpsetSeaturtle37>
What's your clearml-session version? where is the remote machine ?
And yes if the network connection is bad we have seen this behavior you can try with --keepalive=true
Notice that these are SSH networking issue, not something to do with the clearml-session layer the --keepalive is trying to automatically detect these disconnects and make sure it reconnects for you.