Hi @<1654294828365647872:profile|GorgeousShrimp11> , long story short - you can.
Now to delve into it a bit - You can trigger entire pipeline runs via the API.
I can think of two options from the top of my head. First being some sort of "service" task running constantly and listening to something and then triggering pipeline runs.
The second, some external source sending an POST request via API to trigger a pipeline.
What do you think?
Hi @<1722061354531033088:profile|TroubledCamel37> , what do you see in the apiserver logs?
@<1526734383564722176:profile|BoredBat47> , that could indeed be an issue. If the server is still running things could be written in the databases, creating conflicts
Hi @<1526734383564722176:profile|BoredBat47> , do you see any errors in the elastic container?
This is exactly what the build command is for. I suggest reviewing the documentation
@<1545216070686609408:profile|EnthusiasticCow4> , I think add_files
always generates a new version. I mean, you add files to your dataset, so the version has changed. Does that make sense?
I think as long as they have different hashes you will have two different files
Hi @<1749965229388730368:profile|UnevenDeer21> , can you add the log of the job that failed?
Also, note that you can set these arguments from the webUI on the task level itself as well, Execution tab and then container section
Hi @<1749965229388730368:profile|UnevenDeer21> , an NFS is one good option. You can also point all agents on the same machine to the same cache folder as well. Or just like you suggested, point all workers to the same cache on a mounted NFS
Hi @<1750327622178443264:profile|CleanOwl48> , you need to set the output_uri
in Task.init()
for example to True
to upload to the files server or to a string if you want to use s3 for example
You have the open source repository of the documents - None
I think you could generate a pdf from that with some code.
Hi @<1753589101044436992:profile|ThankfulSeaturtle1> , what sort of materials do you think you're missing?
Hi @<1752501940488507392:profile|SquareMoth4> , you have to bring your own compute. ClearML only acts as a control plane allowing you to manage your compute. Why not use AWS for example as a simple solution?
Hi @<1755038652741718016:profile|LuckyRobin32> , how are you pointing to the folder?
Did you run the code locally first? I don't see the agent installing the packages themselves, did you remove it from the log or how are the packages being installed?
You can add torch to the installed packages section manually to get it running but I'm curious why it wasn't logged. How did you create the original experiment?
@<1664079296102141952:profile|DangerousStarfish38> , can you provide logs please?
Hmmm I would guess something would be possible but I'm not sure how to go about it. Maybe @<1523701087100473344:profile|SuccessfulKoala55> or @<1523701994743664640:profile|AppetizingMouse58> can give some more input.
Hi @<1751777178984386560:profile|ConfusedGoat3> , I think you might need to run some migration script on the database, basically changing the paths of the artifacts registered to the new IP
I think you'd have to write it yourself. Basically, the artifact paths in experiments are saved in mongo. You would need to write a script that would modify those values in Mongo directly
I think this is what you're looking for - None
Hi JitteryCoyote63 ,
Regarding Edit 2: This seems like a nice idea.
Regarding adding option to only stop them - Please open a feature request on GitHub 🙂
Hi JitteryCoyote63 , can I assume you can ssh into the machine directly?
@<1523701977094033408:profile|FriendlyElk26> , try upgrading to the latest version, I think it should be fixed on the latest version
ClearML only keeps links to the data, so if you simply put the copy you have in the same paths, everything will work as before
Also, what if you try using only one GPU with pytorch-lightning? Still nothing is reported - i.e. console/scalars?
Hi PanickyMoth78 ,
What version of ClearML are you using?
Strange, I'm not familiar with tensorboard_logger
package. I see it's latest package on pypi is also 0.1.0 with latest supported python 3.5.
Scalers are usually reported and auto captured through SummaryWriter if I'm not mistaken. I found an example here:
https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_tensorboard.py
Anyhow I'll take a look into it 🙂
Is it possible that you don't have permissions for deletion on that Azure account with your credentials?