Yes actually that might be it. Here is how it works,
It launch a thread in the background to do all the analysis of the repository, extracting all the packages.
If the process ends (for any reason), it will give the background thread 10 seconds to finish and then it will give up. If the repository is big, the analysis can take longer, and it will quit
JitteryCoyote63 oh dear, let me see if we can reproduce (version 1.4 is already in internal testing, I want to verify this was fixed)
How are you getting:
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
is this what you had on the Original manual execution ? (i.e. not the one executed by the agent) - you can also look under "org _pip" dropdown in the "installed packages" of the failed Task
Back to the feature request, if this is taken care of (both adding a missed package, and the S3 upload), do you still believe there is a room for this kind of feature?
In the side bar you get the title of the graphs, then when you click on them you can see the diff series on the graphs themselves
I'm assuming those errors are from the triton containers? where you able to run the simple pytorch mnist example serving from the repo?
ShaggyHare67 could you send the console log trains-agent
outputs when you run it?
Now theย
trains-agent
ย is running my code but it is unable to importย
trains
Do you have the package "trains" listed under "installed packages" in your experiment?
1e876021bbef49a291d66ac9a2270705
just make sure you reset it ๐
I remember there were some issues with it ...
I hope not ๐ Anyhow the only thing that does matter is the auto_connect arguments (meaning if you want to disable some, you should pass them when calling Task.init)
Hi SubstantialBaldeagle49
yes, you can backup the entire trains-server (see the github docs on how) You mean upgrading the server? Yes, you can change the name or add comments (Info tab / description ), and you can add key/value description (under the configuration tab, see user properties)
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.
Okay let me see if we can reproduce & fix this, it should not be long
I managed to set up my (Windows) laptop as a worker and reproduce the issue.
Any insight on how we can reproduce the issue?
I see it's a plotly plot, even though I report a matplotlib one
ClearML tries to convert matplotlib into plotly objects so they are interactive, it it fails it falls back into a static image as in matplotlib
Hi SubstantialBaldeagle49
2. Sure follow the back procedure and restore on the new server
3. Yes
task=Task.get_task(task_id='aa')
task.get_logger().report_scalar()
. That speed depends on model sizes, right?
in general yes
Hope that makes sense. This would not work under heavy loads, but eg we have models used once a week only. They would just stay unloaded until use - and could be offloaded afterwards.
but then you still might encounter timeout the first time you access them, no?
No, it is zipped and stored, so in order to open the zipfile and read the files you have to download them.
That said everything is cached, so if the machine already downloaded the dataset there is zero download / unzipping,
make sese?
@<1523707653782507520:profile|MelancholyElk85> I just run a single step pipeline and it seemed to use the "base_task_id" without cloning it...
Any insight on how to reproduce ?
Thanks PompousBeetle71
Quick question, what frameworks are you using?
Do you use save
method directly to file stream (or any other direct storage)?
The easiest would be as an artifact (I think).
Let's assume you put it into a csv file (with pandas or mnaually)
To upload (from the pipeline Task itself):task.upload_artifacts(name='summary', artifact_object='~/my/summary.csv')
Then if you want to grab it from anywhere else:task = Task.get_task(task_id='HPO controller Task id here') my_csv = Task.artifacts['summary'].get_local_copy()
If you want to store as dict it might be even easier:
` task.upload_artifacts(name='summary', artifa...
Could be nice to write some automation
Sounds good, I assumed that was the case but I was not sure.
Let's make sure that in the clearml.conf
we write it in the comment above the use_credentials_chain
option, so that when users look for IAM roles configuration they can quick search for it ๐
I think I found something, let me test my theory
the trend step artifact used to keep track the time of the data so we know the expected trend of the input data. For example, on the first data which is trend_step = 1 the trend value is 10, then if the trend_step = 10 (the tenth data) our regressor will predict the trend value of the selected trend_step. this method is still in research to make it more efficient so it doesn't need to upload artifact every request
Make sense! I would suggest you add a GitHub issue with feature request ...
strange ...
mostly by using
Task.create
instead of
Task.init
.
UnevenDolphin73 , now I'm confused , Task.create is Not meant to be used as a replacement for Task.init, this is so you can manually create an Additional Task (not the current process Task). How are you using it ?
Regarding the second - I'm not doing anything per se. I'm running in offline mode and I'm trying to create a dataset, and this is the error I get...
I think the main thing we need to...
Sure thing, and I agree it seems unlikely to be an issue ๐
Hi IntriguedRat44
Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?
Clearml 1.13.1
Could you try the latest (1.16.2)? I remember there was a fix specific to Datasets