<_niklas> "x liked your post and added: <comment>"
antomati_ is now known as antomatic
<erin> turns out tmux does not like creating 100 panes
<mtntmnky> yeah, I think writing the media URLs out to a text file is a better way
<_niklas> (those comments don't show up anywhere else. usually they're short though)
<teej_> _niklas: So those are essentially responses? And comments?
<_niklas> it's _one_ way of responding to posts
<_niklas> generally if you're serious about it you'd reblog instead, letting you respond with a full blog post and showing your response to your own followers
<erin> you can respond to posts by reblogging, commenting, or private messaging
<erin> tumblr is
<erin> really weird in terms of interaction dynamics
<teej_> Oh. Were the notes taking a long time to archive for everyone?
<_niklas> yes
<teej_> I see.
<pnJay> impossibly long
<marked> most of the notes are reblog and likes ,which like what FB has, but FB compacts into a like count number
<_niklas> because you can only get 50 per request
<cslashacc> notes are yuge
<JAA> There can be millions of notes on a post.
<pnJay> Flashfire: yeah, this is our warmup for the whole site I guess.
<JAA> On popular blogs, that is.
<cslashacc> 4 days boys
<cslashacc> n grils
<marked> so it's the same as FB but designed it well, adn their comments are threaded.
<JAA> Yup, arkiver mentioned earlier that we'll probably just continue after the adult content is gone.
<teej_> So I'm assuming that the jobs will run must faster now,
<erin> teej_: we saw blogs where the notes alone took >24h
<marked> so in practically, notes on tumblr is a way to find links for other users
<cslashacc> @JAA how to continue after they are gone? doesnt that mean we cant archive?
<pnJay> JAA: thats very exciting cause I get new infrastructure end of the month :D
<JAA> cslashacc: Continue with the rest of Tumblr, i.e. non-adult blogs.
<teej_> But I have at least 20 blogs currently downloading. Should I stop them?
<cslashacc> ah I see
<marked> Hm, VZ bought a failure, so is that news ?
<Flashfire> So should I start adding non adult blogs to that google form
<pnJay> I think at that point Flashfire we just do *.tumblr.com
<pnJay> o_o
<marked> right now it's priority queue, most likely to disappear
<teej_> I think we should stick to the ones that will get removed soon, first. Then we can do the rest.
<JAA> We can think about how to do that after the adult material is gone. There should be a fair amount of time before Tumblr goes down entirely.
<JAA> But until at least Monday, let's just focus on the adult blogs.
<marked> this is my longest notes example http://squirrel.tumblr.com/post/138861850930
<pnJay> Is there a way to setup a ping if we get a new update, I have scripts running and dont want those boxes to idle
<caff> yeah, priority is adult then I'd say we do art blogs since they are at a very high risk of losing content as well
<_niklas> speaking of "after", are there plans to give people an easy tool to submit their own blogs to the archive?
<_niklas> for when they go private but not yet deleted
<JAA> pnJay: Watch the GitHub repo I guess?
<teej_> That's confusing. Wouldn't they be deleted afterwards if they are NSFW?
<JAA> Nothing gets deleted for now, but the adult blogs won't be publicly viewable after Monday.
<JAA> The owners can still see them and have more time to download everything presumably.
<Jens> Uploading to fos is slow :/
<SketchCow> How are thiiiiing
<SketchCow> s
<pnJay> Much faster.
<marked> energy was high before the code push, then people got quiet after
<pnJay> Thats because the terminals are scrolling fast and everyone is now content.
<boutique> computer go vroom
<diggan> indeed, items/hour seems to be going up
<caff> yeah, less chat more downloading
<caff> we're past "lend us your strength" and into actually using it
<pnJay> Cutting was heated, not taken lightly, but I think we did the right thing
<pnJay> like Mr Scott said, hard decisions time
<erin> were we going to do a tracker reset now that we have notes cut
<marked> jobs will still spiral. it's not actually us, but tumblr allows stylesheets that are crazy and outdated
<caff> I'd rather have half notes half no notes than breaking everything
<Jens> <100 KB/s upload speed.
RavenWork has joined #tumbledown
<erin> i'm getting 1.337MBs up / ~20MBs down from a 45GB DO droplet
<diggan> Jens: probably on your end, I've topped 30 MB/s
<Jens> I've got ~60 mbit upload.
<RavenWork> My warrior is now on day three of a single tumblr blog "item", and it seems to have just hit page 26 out of 510 for that blog
<diggan> RavenWork: which version are you running? Is it currently fetching notes?
<RavenWork> yes
<Jens> What's the fos address? I wanna do a traceroute.
<RavenWork> I don't know how to tell what version, but it's fetching notes
<erin> chauffer: how did you end up setting up your k8s stuff, and how did it end up working?
<JAA> Jens: fos.textfiles.com?
<marked> notes was only cut on the last push
<Jens> JAA: Right, thanks.
<diggan> RavenWork: was an update a while ago (hour maybe) that started skipping the notes
<erin> now that i have one big stable box i'm tempted to look into diversifying onto GCP
<RavenWork> how do I update my warrior
<_niklas> might want to update. in any case, "26/510" doesn't mean much, it's not entirely first page -> last page
<diggan> should auto-update, but if it's a running job, it's probably gonna finish it first
<RavenWork> right, but my point is, at this rate I'm not sure this blog will finish before the hammer comes down
<_niklas> if it's a warrior (not manually setup scripts), just force shutdown then start again
<erine> I got too many jobs on the old version :(
<marked> I would suggest restarting jobs that look notes dominated
<diggan> yeah, shutdown the warrior and start it again, will fetch the latest on start
<RavenWork> does it seem worth it even though I'll also lose another item that's been going for 10 hours
<RavenWork> (and is also collecting notes)
<caff> I wonder, were all the current out jobs reset with note pruning?
<erin> caff: i was wondering that too
<RavenWork> (are half-finished jobs sent back at all or just lost forever?)
<erin> they have to be manually reset
<marked> I'd prefer using a RBL like email servers do, but that would violate the property of versioning unless the version autoincrements from the RBL version
<marked> ^Real Time Blacklist
<RavenWork> I guess I have to do "stop immediately" or else these items will keep running for days more, right?
<erin> lol i've got 20GB down in the last 10 minutes
<pnJay> RavenWork: correct, and don't worry, we recycle lost jobs
<marked> it's been known to happen. there's no reason to think the rest of that blog is any better than the first part of it.
<erin> one big server is working soooooo much better than tons of smol ones
<boutique> diggan: an awful lot of your items seem to coming in with 0MB, is that a problem?
<Jens> Trace to fos doesn't look bad, really.
<diggan> boutique: no, happens when I have to restart the warrior, doing changes in the infra...
<boutique> ah! that makes sense.
<diggan> after a while they all start working on bigger jobs, but in the beginning there seems to be a lot of ~0MB jobs
<erine> What's a good concurrency for a big box with a few hundred GB free?
<erin> > few hundred GB
<erin> disk?
<erine> ye
<erin> hmmmmmmm, how much RAM?
<erine> I've been running two C15 warriors on each box
<erine> 32GB
<erine> two boxes
<teej_> I'm going to stop most of my jobs and fetch the latest repo change. Is that fine?
<RavenWork> ok, it's rebooted and seems to be skipping the notes this time, thanks
<Jens> teej_: Yes.
<teej_> Will we add notes back after the NSFW blogs are done/gone?
<pnJay> RavenWork: thanks for running a warrior :D
<erine> gluster is cool for slapping in disk nodes with VMs with an unlimited internal bandwidth limit
<RavenWork> out of curiosity -- does this script skip images that it knows another blog has already picked up (i.e. reblogs) or is it grabbing everything and sorting out the redundancy later?
<erine> Just don't use striped since that makes the warriors D lock for some reason
<pnJay> everything and we dedup later
<teej_> So sad that I had 101k requests for 20ish blogs each...
<RavenWork> fair enough!
<marked> it's grabbing images, no small avatars, no videos
<teej_> They're never ending.
<caff> I think notes just take up way too much time and aren't priority vs actually getting content
<diggan> getting 141MB/s in total now! https://i.imgur.com/jCMC3D7.png
<diggan> let's see that items/hour raaaaiiiisseeee!
RavenWork has quit [Leaving]
<caff> I've got 5 warriors with 4 crawls each
<caff> modded to a gig of ram
<boutique> diggan: whereabouts are you hosted at?
schnits has quit [Read error: Connection reset by peer]
<erine> woah, is that grafana loader script open source?
<foureyes> oh dear... 11 jobs with almost 200GB of uncommitted data, running since 3,5 days - sit and wait or rip it up and start again?
<erin> ugh, i had to tear everything on my big box down and start again (config glitch) but otherwise, holy _fuck_
<erin> this is working so much better than before
<diggan> erine: yeah, posted before, look for pastebin links
<JAA> Indeed, a lot better.
<diggan> foureyes: Is it fetching notes? was an update a while ago (hour maybe) that started skipping the notes. should auto-update, but if it's a running job, it's probably gonna finish it first. shutdown the warrior and start it again, will fetch the latest code on start
kbtoo_ has joined #tumbledown
<erin> *slaps top of server* this bad boy can fit so many concurrencies in it
<_niklas> is that megabytes per second?
<_niklas> with 30 concurrencies total?
<erin> _niklas: 200 concurrency on a /big/ machine
<erin> 1T disk, 48GiB ram
<pnJay> I have 3 DO VPS running 200 concurrent per box
<erin> that's what i'm doing, yeah
<foureyes> diggan: lots of notes, but also still fetching posts, unfortunately
<erin> it's not actually 200 concurrent
<erin> it's um
<erin> 2 concurrent per worker
<_niklas> oh I confused you with someone else
<erin> and i'm running 100 workers
<erin> ah! haha
<pnJay> Yeah what Erin said
<pnJay> I’m doing that as well. XD
<erin> it was based on trvz's recommendations
<pnJay> Omg twinsies.
<erin> haha!
<erin> i really wish i could run this on our work cluster at uni
<_niklas> I kiiinda wanna go 200 on this shiny new box I just ordered but I only have 500g disk
<pnJay> I was running 2 instances on our work hyper-v cluster but weirdly it was killing our gateway cpu
<erin> every server has like, 2-16 TiBs of locally-mounted /scratch/, plus an attachment to hundreds of TiBs of glusterfs
<erin> all the compute machines have 128-288GiB of ram
eprillios has joined #tumbledown
<erin> it's like, if i wanted to download porn fast? that's what i'd use
<erin> that would also definitely get me kicked out of uni
<pnJay> I have always wanted to learn gluster. I will harass you about that later if that’s okay. XD
<Flashfire> Run BOINC for a while then when a SFW warrior project comes up then run that
<Flashfire> URLteam is always looking for more people
<marked> is URLteam ever "done"
<JAA> Nope
<pnJay> Tracker IP limits me on url team
<Flashfire> Not that I would think
<JAA> Yeah, URLTeam needs *lots* of IPs, but nearly zero resources.
<eprillios> Heya all. Copying this from #warrior. I am running a Warrior instance, currently on the Tumblr project. A few successive power outages occured, which have resulted in jobs being lost. Those outstanding claims would need to be freed. My name on the tracker is the same.
<marked> where'd their name go ?
<JAA> arkiver, Kaz, kiska: ^ (Also, I can haz access?)
<erin> pnJay: i have NO CLUE how gluster works
<zeraT> while we're on the subject, in 5 minutes or so here an upload of mine will finish and then I'm going to restart my warrior with the current patch
<erin> i just put data on /mnt/gluster and everything is magic :p
<arkiver> JAA: you have access now
<arkiver> to the tracker
* arkiver is off to bed
<JAA> Thanks.
<Fusl> does someone have the link to the google form?
<JAA> eprillios: I released your claims.
<eprillios> Thanks!
<erin> Fusl: for future reference, it's the goo.gl link in the /title
<zeraT> there's something ironic about using a link shortener to fit it in the title
<marked> should we save that for URLteam?
<JAA> goo.gl will be scraped from April on.
<JAA> When no new links will get created anymore.
<mbp> almost forgot about that one
<Fusl> oh nice
<Fusl> zeraT, erin: thanks
<Fusl> JAA: i smell hard rate limiting by google recaptcha :'(
<zeraT> JAA while you're at it I'm restarting now fo rthe new patch, if you would like to clear my cliams also, now would be the perfect moment
<JAA> Fusl: Yeah, it'll be a PITA most likely.
<marked> would it make sense for restarting clients to pick up their last job assignment?
<JAA> zeraT: All of your claims? 18 currently.
<JAA> I see some claims from just a few minutes ago.
<zeraT> yeah I think there's a few from a poweroutage yesterday also
<zeraT> I just finshed uploading, no work is actually lost
<zeraT> it's not a huge contribution but I'm poroud of my 30GB
<JAA> So you're sure nothing is running anymore, not even the stuff that was claimed 5 minutes ago?
<zeraT> the warrior is off
<JAA> Ok, released.
<zeraT> thanks, I'll grab that patch now
<kvikende> i know i have some lost ones, ran out of disk space so i remade my warrior with larger diskspace... hemicoupe and oddjuice definitely was lost probably some others i cant recall
<zeraT> turns out running into a yahoo dumster fire to resuce everything was a messy process, who knew
<zeraT> looks like I'm back up and running, thanks much
<JAA> Yeah, weird, right?
<Fusl> 2009
<Fusl> [Yahoo!] found the way to destroy the most massive amount of history in the shortest amount of time with absolutely no recourse
<Fusl> 2018
<Fusl> [Yahoo!] found the way to destroy the most massive amount of history in the shortest amount of time with absolutely no recourse
<zeraT> I'd recognize that quote anywhere
<JAA> 2019 also.
<Fusl> oh god no
<Fusl> what are they going to kill in 2019?
<JAA> GeoCities Japan
<JAA> Or is that not Yahoo JP?
<zeraT> can't they just hurry up and go bankrupt smielyface
<kvikende> im just glad i didnt lose sexualconfessions because it seems like a place where teens could get proper sexed judging by the titles
<mtntmnky> yahoo answers in 2019 probably
<JAA> Yeah it is.
<JAA> We did save a good part of Yahoo Answers a while ago.
<JAA> Might need to revisit that project sometime.
<erin> JAA: while you're at it, can you release all my (`iwxzr`) claims from before 6:26 PM CST today?
<erin> there's probably a decent number of dead ones matching those criteria
<zeraT> if I've learned anyuthing from watching the output of the console on this project, it's that people are really into all kinds of wierd shit
<urjaman> :D
rpl has joined #tumbledown
<kvikende> hehe, look at the blog names on the tracker
<kvikende> so much weird stuff XD
<Frinkel> So quick question, I'm running four instances of the script outside of the archive warrior thing, is there any way i can make the scripts autoupdate while they're running that doesn't involve shutting the instance down, pulling updates from git, and starting again?
<JAA> erin: Hrm, yeah, there's a good number of them, but I don't think it's possible to filter by both username and date, and I don't feel like clicking on a hundred buttons one at a time.
<erin> JAA: agh, i'm sorry!
<erine> I have a feeling my graph should not look like this! https://i.imgur.com/ZoCeGys.jpg
<JAA> No worries, we'll requeue them at some point.
<kvikende> i dont think so frinkel, but i might be wrong
<JAA> Frinkel: Nope, not possible.
<kvikende> we love jesus?
<Frinkel> darn. well, maybe it's time to see if i can bind docker images to other IPs than the default, that's honestly why i was running all of the scripts
<erine> Internal group meme on my shared grafana, haha
n00b709 has joined #tumbledown
n00b709 has quit [Client Quit]
<kvikende> jaa, could you release hemicoupe and oddjuice from kvik131? i know those were definitely lost :/
<JAA> kvikende: Done.
<kvikende> thanks :D btw thanks for the monitor script. works wonders on my poor rpi doing its best
<JAA> :-)
<JAA> I'm glad you find it useful.
<erine> Where is this monitor script?
boutique has quit [Quit: Leaving]
<erin> the monitor script is indeed wonderful
kbtoo has quit [Read error: Connection reset by peer]
<mbp> that is really good
<mbp> thanks
BrickGras has quit [Read error: Operation timed out]
<mbp> is there a way to run it against a warrior?
<JAA> No idea.
<JAA> You'd have to log into the warrior somehow.
<JAA> I think that's possible, but I've never used the VM, so I don't know how it works.
<mbp> i have asked before whether there is a way to ssh into the vm
boutique has joined #tumbledown
<mbp> you should have just said RTFM :)
kbtoo has joined #tumbledown
<kvikende> reading is hard tho
<marked> RTFWiki . The wiki is so friendly and wants your love
<kvikende> bet thats why it calls me princess
<Frinkel> hot single wikis in your area
<Frinkel> looking for reads
<Frinkel> ...i'll see myself out
<kbtoo> Apparently my computer doesn't like running vbox in the background that's the second time this computer has crashed since i've been running it real trippy looking graphics error tho.
<boutique> hot single wikis in your vents
<kvikende> thats weird kbtoo
<marked> touch .me
<kvikende> push .me
<boutique> get: satisfaction
<zeraT> looks like the items/hour is already noticably improved
upshift has joined #tumbledown
<mbp> the VM has wget, the VM doesnt have bash
<_niklas> I added a new box with 150 total concurrency 20 mins ago, I would hope mine goes up :^)
<mwfc_> how much concurrency on one ip address makes sense in this case?
<kvikende> i think the vm runs busybox so ash
<_niklas> we're not running into ratelimit issues
<Frinkel> wait so 6 isn't the max concurrency for a single instance?
<_niklas> 6 is the max for a warrior
<mwfc_> _niklas: good :) i spin up 50 too
<Frinkel> ohhhh, so if you're running outside of a warrior, you can go hogwild
<_niklas> you can go above 6 with the raw scripts or by running more than one warrior (preferred)
<Frinkel> is there a way to run more than one warrior on the same IP? it seems to get a bit angry when i try to do that
<JAA> mbp: Well yeah, the VM is very minimal, so that's not too surprising. You might be able to get my script to run on sh/dash/whatever, but it definitely needs some changes for that to work. Also, some of the other dependencies might be missing.
<_niklas> start multiple VMs/containers
<rpl> is the overall crawl being throttled on tumblr's side? it seems like there are quite distinct pauses on fetches even for small pages on the warrior-run wget.
<mbp> i just installed bash
xzcvb12 has joined #tumbledown
<zeraT> just had this strange task happen, here's a paste if there's some debugging to do with it https://pastebin.com/VY87GazB
<_niklas> rpl: probably more down to parsing
<rpl> (like if I run a manual crawl on my workstation using an authenticated client, it goes dozens of times faster)
<mbp> but ill have to adjust the script a little bit more
<marked> I have had a persistent connection before that was running slow. but when I opened a parallel connection the new was back to normal speed.
<rpl> _niklas: no way, it does not take measurable time to parse a web page for URLs..
<_niklas> if they were seriously throttling, roflscaling your concurrent items wouldn't help, but it evidently does
<rpl> yeah, I know. that's what I'm trying to figure out. I have approached it by just running lots in parallel but .. _something_ is making it far slower than need be.
<rpl> on a per-wget basis
<rpl> they're taking hours-to-days to crawl just a few tens of gb per item
<rpl> they should be cruising through that in minutes
<_niklas> yeah, I agree it's odd. but I don't know the stack in play
<rpl> mhm
<kvikende> been wondering if wget does some throttling internally or something
<_niklas> nah
<marked> I'm going to ask for volunteers for benchmarking after I write a way to switch from dev to test
<_niklas> on a fast connection you'll find that media downloads actually scroll past quicker than html
<kvikende> that i have noticed. jpgs are superquick
<rpl> much
<rpl> which is weird
<_niklas> well, they're on cdn
<upshift> I have the default 2 concurrent items on a warrior vm, but it has been wgetting urls from the same 2 items for about the past 10 hours; is that normal? It up to 23000 urls on one and 28000 on the other.
<_niklas> entirely normal, unfortunately
<zeraT> upshift some blogs are huge, some run for days unfortunately
<kvikende> some blogs are really large, think 8000+ pages. takes a long time to crawl
<upshift> ok, I think it's downloaded something like 13 GB
<zeraT> that wouldn't be unheard of
josho4933 has joined #tumbledown
<kvikende> i have a job running for over 1 day XD
<teej_> diggan: Where do you get your metrics page from? That's very nice.
<rpl> it's not the CDN-ness. we're just not turning around from each fetch very fast.
<SketchCow> Looks petty fast
<JAA> I have a job running for 3 days, 10 hours. It seems to be getting close to finishing though, finally.
<rpl> I'm throwing a fairly beefy server machine at it in a datacenter with real bandwidth, and even at 40-way concurrency it's not even keeping up with a 5yo laptop running a different (non wget-lua) tumblr-grabber on a consumer connection
<rpl> something's seriously off
<marked> did warrior restart at once? if so it's normal to have a spike as small jobs come in before the large ones.
<kvikende> been wondering whether invoking wget is the bottleneck but i have no way of proving that .. i could theoretically try to replace wget with one of the python scrapers or something but thats a bit too much effort to test a hypothesis :P
<foureyes> rsyncing a finished ~600MB warc to fos.textfiles.com hasn't finished in 50 minutes :-/
<marked> join my benchmarking project
<rpl> profiling says it's lua: https://pastebin.com/34L1mCsR
<marked> or maybe it's done before it's started
<marked> does overhead equal wall time? I can't tell this takes into account io wait
<marked> ^blocking
<kiska> Holy scrollback, 1.5k messages
<JAA> Lua match() should have zero iowait though.
<JAA> But I'm not surprised that it's taking long.
<marked> is this the same scrollback SketchCow talks about?
<marked> there's so many lua string match options it's not obvious which are faster than which
<JAA> kiska is talking about how much was posted in this channel. SketchCow is talking about how fast things scroll by on the tracker.
<mtntmnky> it might be the image matching that's causing problems
<mtntmnky> I should be able to get started on a fix for that in a bit
<kvikende> time to nest them XD
<SketchCow> Tumblr is now at: 459,450,268 archived URLs
<SketchCow> (This is what has gone through FOS and been uploaded and ingested by wayback)
<rpl> marked: that's not iowait, just cpu time
<mbp> JAA you could add a very very inaccurate "estimated time of completion" column to the script :p
<JAA> mbp: Hell no. :-D
<marked> you mean, make up numbers ?
<erine> Modified version of the Prometheus loader that dumps output messages to Redis! https://pastebin.com/pikuKGrH https://imgur.com/a/VV2HZzZ
<JAA> Yeah, I could put a 'cat /dev/urandom' there.
<Flashfire> Holy shit
<marked> #1, #2, #4, #5 all have the word match in it
<JAA> matchbracketclass sounds like it's [abc], so maybe we should try to get rid of that if possible.
<marked> If I mention people's handle, they won't get paged will they?
<JAA> I'd like to know how the hell PyEval_EvalFrameEx ended up in that perf report though.
<_niklas> has anybody investigated just dropping in luajit
<marked> I was told to switch my [%d] to %d
<marked> voltagex_ and I have a 1.20 wget-lua tree, it compiles but nobodies tested let alone benchmarked it yet
<marked> oh I thought how to test this. still have to write a dev2test patch
boutique has quit [Quit: Leaving]
<marked> do a crawl without lua. do a parse without the network.
horkermon has quit [Leaving]
<marked> kinda like a replay attack
horkermon has joined #tumbledown
<_niklas> is there a one-liner to benchmark an item? I would like to do a real quick unscientific test of liblua vs luajit
<marked> well, allowed() is our function instead of a callback
<marked> and has a bunch of matches
xzcvb12 has quit [http://www.mibbit.com ajax IRC Client]
<rpl> yeah I am checking lua-vs-non
<rpl> I think the lua might be a red herring. like it's taking time but we're just totally swamped in iowait.
<rpl> tumblr is not feeding most pages back in a very timely fashion
<_niklas> hm
<marked> so my proposed benchmark is completition time vs network latency
<_niklas> how do your crawler's requests differ rpl?
<kvikende> it seems to be quick if i visit the blog in my regular webbrowser
<marked> australia versus NY
<rpl> _niklas: unsure, digging.
<erine> and with the redis watcher mod, combined output watching is a thing now for multiple warriors. https://pastebin.com/LixCpGfQ https://i.imgur.com/13L3Hjr.png
<josho4933> Is the an estimate to how many blogs remain to be crawled?
<marked> they have 10million daily active users
<JAA> rpl: Does your other crawler have parallelism?
<voltagex_> rpl: benchmark against warcs if possible
<voltagex_> rpl: take network out of the equation
<_niklas> OHHH
<_niklas> I haven't taken enough samples to be sure
<_niklas> but I think it's throttling spiders
<_niklas> and we're identifying as googlebot
marked2go has quit [Read error: Operation timed out]
<rpl> yeah, I think so
<rpl> (googlebot)
Darkeneto has joined #tumbledown
<rpl> voltagex_: how would I benchmark against a warc?
superKB has joined #tumbledown
<JAA> That sounds plausible.
<marked> I"ll test that
<voltagex_> rpl: unknown, perhaps there's a http proxy that can be fed a warc
<rpl> oh, ok
<voltagex_> marked: remember we hit GDPR and login issues when we're not Googlebot
<JAA> Yeah right, we can't get rid of the Googlebot UA easily.
<erine> JAA: Just made a terrible mistake. Could all tasks assigned to me be put back except for tasks assigned to IP
<JAA> We'd have to switch to login-based crawls I assume.
<rpl> yeah I was running other crawler under an actual account
<mtntmnky> I wonder how it does the throttling
<mtntmnky> we could have pipeline.py pick from random user-agents
<mtntmnky> like maybe it will allow BingBot too
<_niklas> I tried slurp
<JAA> erine: Hmm, not sure if I can filter for IP addresses.
<_niklas> it seems to throttle that too
<voltagex_> mtntmnky: if (bot) then (slow)
<mtntmnky> does it throttle per IP address, per user-agent, or per IP/user-agent combo?
<erine> looks like it's time to start over from 0!
VerifiedJ has quit [Read error: Connection reset by peer]
<_niklas> by throttle I just mean underprioritise
<JAA> erine: Don't worry about it. We still have plenty of work to do for now.
Seong has joined #tumbledown
<erine> Does it matter if I leave that warrior running after getting my tasks flushed?
<JAA> The items you still have running might get duplicated.
<erine> ah, probably a bad thing
<JAA> Yeah
<JAA> Well, we'll probably requeue everything sometime on the weekend anyway.
<erine> purge me now then
<JAA> everything = all items that haven't completed
<erine> all warriors are down
<kvikende> heh im at 101% progress according to your script jaa XD
edgivesup has joined #tumbledown
<kvikende> dont think thats entirely correct
<JAA> erine: Ok, released your claims.
<JAA> kvikende: Yeah, I'm not sure why it happens. It might be links to deleted posts or something like that.
<JAA> I had one finish at 128 %.
<voltagex_> JAA: I can't find it right now, but I posted a list of jobs that likely failed when one of my big boxes went down
<JAA> voltagex_: I don't think I can easily release a list of items or items for a user before a certain date. So unless it's a very short list, releasing those is a PITA.
<voltagex_> What's the time out?
<voltagex_> Is there any way I can set run-pipeline to go through a certain list?
<JAA> There is no timeout.
<JAA> Nope
<voltagex_> Fuck. So they're gone gone?
<voltagex_> That's a pretty big issue IMO
<JAA> They're marked as in progress for now. As mentioned, we'll requeue all of those sometime on the weekend probably, so it'll get grabbed then.
<voltagex_> Guessing the fail rate is higher on this project than others
<JAA> Yes, that's why we have 29k "out", i.e. claimed items.
telnoratt has joined #tumbledown
<voltagex_> Okay. Will check my budget and make a big push on the weekend
<JAA> We definitely don't have 29k items actually in progress currently.
<mbp> JAA ive got the script running on the VM but it only outputs 1 line
riley has joined #tumbledown
<voltagex_> mbp: what does pgrep wget-lua | wc -l say?
<mbp> there are 3 running
<mbp> thats not the problem
<mbp> had to install a bunch of dependencies and add docker exec -i lucid_feynman before every command that accesses the log and warcfile
<marked> voltagex_: you can use the test tracker if it helps
<voltagex_> What is the test tracker?
<mbp> the cache file only has 1 entry as well
<mtntmnky> mbp: oh you need docker support for the script?
<JAA> mtntmnky: I think the warrior VM runs the Docker container internally.
<mtntmnky> try this https://linx.li/8f32h495.sh
<mtntmnky> that's what I've been using
urjaman has quit [Ping timeout: 260 seconds]
horkermon has quit [Read error: Connection reset by peer]
<mbp> looks like the blogs its not listing are deleted
horkermon has joined #tumbledown
<mbp> or not, hm
<edgivesup> Given how little my machine uploaded today, i wonder if the VM paused execution from starving the disk again
<teej_> I don't understand why macOS has terminal commands that operate differently from regular Linux/Unix environments.
<marked> on a parallel crawl, our Chrome's user agent finished when our current user agent was 1/5 into it
<teej_> I'm trying to use JAA's awesome monitor script.
<kbtoo> Least NSFW i've seen so far cutekittensarefun.tumblr.com
<JAA> Because Apple.
<mtntmnky> marked: maybe we could have pipeline.py do a check to see if the blog require safe mode
<mtntmnky> and change the user-agent based on the result of tha tcheck for the actual crawl
<_niklas> don't forget the cookie for GDPR crap
<mtntmnky> oh yeah, that
<erine> Oh fuck, GDPR?
<mtntmnky> mbp: thanks!
<erine> is germany one of those countries?
<_niklas> all EU
<JAA> Germany was in the EU last I checked, so yes.
<mbp> just tried your script and that works
<_niklas> don't worry our crawls are fine for now
<_niklas> this is the other thing spiders bypass
edgivesup has quit [Read error: Connection reset by peer]
<mbp> and shows all current jobs
<mtntmnky> mbp: excellent
<erine> ah, was about to panic about uploading 172 GB of GDPR crap
urjaman has joined #tumbledown
<mtntmnky> I thought you found a bug at first :)
<mbp> i wonder why my version doesnt work
edgivesup has joined #tumbledown
<mbp> but im neither an awk nor a docker wizard to find out why
<marked> I don't even get GPDR, has IP geo databases got accurate over the years?
<marked> UK's almost out of the EU
<mbp> i think you can pay big bucks for a somewhat accurate db
Darkeneto has quit [http://www.mibbit.com ajax IRC Client]
<teej_> When you do `ps -C wget-lua --format 'pid,rss,etime,cmd' --no-headers`, what does it show?
edgivesup has quit [Remote host closed the connection]
edgivesup has joined #tumbledown
horkermon has quit [Read error: Connection reset by peer]
horkermon has joined #tumbledown
<mbp> heh im a dummy
<mbp> why did i use an interactive shell in the first place
<teej_> mbp: I don't understand.
<mbp> my problem with the monitoring script was that i just got one line of output
<kiska> I have yet to finish reading the 1.5k messages...
<kiska> I just tabbed into my aws instance running 82 concurrency, and I see this: 15.1G/15.7G
<teej_> Hi kiska. Good morning.
schnits has joined #tumbledown
<mbp> going from "docker exec -i" to "docker exec" fixed it
<marked> Morning. some wanted to wake you. it was kinda exciting but sleep <> exciting
boutique has joined #tumbledown
<superKB> I feel like a n00b. I was searching more, and noticed the deadline. I got mere 7 vm's up.
<teej_> superKB: Don't worry. It happens.
<teej_> I'm also a newbie.
<superKB> Well, at least they're setup and at the ready... I'll just spin em when the next panic happens so I don't break the bank
<Flashfire> .........
<Flashfire> Kiska Its not morning its afternoon
<Flashfire> Fix your sleep schedule dude
<voltagex_> What is the deadline
<superKB> Looks like monday
<superKB> I have enough power to pull a few hundred blogs at a time, is there a way to identify how many blogs remain?
<voltagex_> No, just run the pipeline and go
<superKB> I figured
<voltagex_> Should we be dissuading those on 1GB VPSes at this point?
<superKB> I don't know. This is just what is running in my basement
<superKB> I've been in the scene for two days now
Ryz has joined #tumbledown
<teej_> voltagex_: This Monday.
<Flashfire> entremulheres release this claim
<teej_> If you're not running the latest release, just terminate your jobs and pull the changes.
<teej_> superKB: ^
<erin> superKB: we will continute archiving tumblr after this because there's a fair chance of it dying soon
<erin> so
<erin> it's not like
<erin> bad if you've set stuff up
<superKB> When was the latest release
<teej_> Yeah, we'll eventually want to archive all of Tumblr.
<teej_> superKB: The latest release is always here: https://github.com/ArchiveTeam/tumblr-grab
<superKB> I got 11 warriors going
<superKB> Do they update on their own?
<teej_> superKB: "Latest commit 2b9be95 4 hours ago"
<voltagex_> teej_: do I lose the data if I terminate?
<teej_> When did you start them?
<superKB> 9 in the last few hours
<superKB> 2 are about 20 hours old
<teej_> voltagex_: Yes, but at this point, it doesn't matter. In the latest changes we ditched the 'notes' downloads, so it will finish the blogs much faster.
<teej_> superKB: Yeah, so if you started more than 4 hours ago, cancel everything and restert the VMs.
<superKB> 1 hr is the 9, and 22 is the first two
<voltagex_> Okay, I'm going to start a large amount in the next few hours (120-180 concurrent across 3 big boi boxes) lmk if I should hold off
<superKB> Ill restart the two
<superKB> Will warriors run on citrix xen easily>
<teej_> voltagex_: I don't think you need to hold off for now, unless kiska makes some massive changes/improvements to the code.
<superKB> I have all that setup and ready too for a work demo
<teej_> Uh, citrix?
<Ryz> When submitting links via Google Forms, how is it determined if the Tumblr account link is already submitted before and/or is archived?
<voltagex_> Isn't xen something different or have they renamed again?
<erine> they should run on xen since they're just VMs
Frinkel has quit [Read error: Connection reset by peer]
<marked> Ryz, who ever loads it is responsible for deduplication
<erine> but for this one off project, if you wanna scale, just setup a warrior manually on ubuntu VMs
<teej_> If it can run a VM, then it should be fine.
<marked> you don't need to worry about that
<superKB> It justy save os install time
<superKB> Because that's what is installed
<teej_> superKB: I recommend just terminating any running Warriors because they're older than 4 hours.
<Ryz> Although a ton of stuff is already saved as of right now, can't hurt to submit more regardless if it's archived or not~
<_niklas> so is anyone working on / planning to work on login? given the predicted much better crawling speed
<teej_> superKB: Unless I misunderstood...
<superKB> teej_ I have two warriors at 22 hours and nine warriors at an hour
<marked> would that just be a cookie ?
<teej_> _niklas: I guess so. We will eventually have to do it for the rest of Tumblr.
<_niklas> I can test if you can use tumblr session cookies on multiple IPs simultaneously
<teej_> superKB: Oh okay. So you can restart the 22 hour Warriors.
<superKB> Acknowledged.
<_niklas> note, I believe being logged in affects the HTML you get
<Fusl> any admin on who can drop my items back into the todo queue?
<marked> I'm lookign for a way to not use cookies but cookies willbe needed for segment anyway so they're not exclusive solutions
<teej_> _niklas: The logins require a username and password. So where do we get those?
<Fusl> i'll need to shut down the dockers due to one of the SSDs failing right now :(
<teej_> Fusl: I thought kiska was on a few minutes ago.
<superKB> teej_: thank you
<teej_> superKB: Anytime.
<kiska> Huh? I was pinged?
<diggan> teej_: guessing a shared account logged in once would give us a cookie to reuse across many instances
<teej_> kiska: Fusl wanted to ask if his/her items can be dropped.
<kiska> Fusl: Ummm.... so many items
<Fusl> yeah
<teej_> diggan: Should somone make a 'fake' Archive Team Tumblr account?
<kiska> I'd rather not try that again
<Fusl> kiska: :D
<kiska> I am still reading messages :(
<_niklas> >note, I believe being logged in affects the HTML you get
<mbp> nice to be there live after a 12h birth of an item
<_niklas> okay just checked, nevermind
<teej_> Does Flickr require email verification for account creation?
<_niklas> that's all ajax now
cslashacc has quit [Remote host closed the connection]
<Fusl> well i guess i'll just restart my dockers, make a zfs raidz3 or so and not a raid0 and whenever you feel like it, kiska, drop my items into the todo queue
<teej_> I'm not exactly sure if this will work well if the HTML is being changed because if the login.
<_niklas> I just said it isn't
<kiska> Fusl: Drop all your items now?
<marked> i noticed the same url will change data sizes on back to back requests btw
<Fusl> kiska: yes
<kiska> Fusl: Released
<Fusl> thanks
<kiska> Fusl: If you had asked me to release certain items I would have responded with
<kiska> [2018-12-14 01:09:28] <JAA> erin: Hrm, yeah, there's a good number of them, but I don't think it's possible to filter by both username and date, and I don't feel like clicking on a hundred buttons one at a time.
<Fusl> thats fine
<Fusl> and we all know what happened last time
yano has quit [Read error: Operation timed out]
<teej_> What happened? I don't know. Lol.
<kiska> It released everyone's claim
<Fusl> someone clicked a button and everything broke apart
<erine> always a feature for the future!
<erine> Future feature, it rhymes :P
<bmcginty_> Is there a way to get a list of all pending and unpending items in the queue? I'm doing a personal archive, and I'd love to be able to start from the other end of archiveteams' list to cover the most possibilities. (I'm also going to run a grabber shortly as well.)
<phirephl-> bmcginty_, but why not just run more grabbers for AT?
<teej_> Lol!
<teej_> Fusl: That's a funny story.
<mtntmnky> working on ignoring media files when matching now
<mtntmnky> I'm seeing duplicate visits here
<bmcginty_> phirephl-: Images are unusable to me, so I'd aim to grab audio for myself first. Not that I'm not wanting to assist AT, but no reason not to do both.
<mtntmnky> happened after following a 301
<mtntmnky> not sure if there's a good fix for that
kode54 has joined #tumbledown
<marked> if you're in the code, I want to log media and vtt urls to a file
<marked> maybe we're not ready to run that now but it's something to think about.
<hook54321> my laptop shut off while the warrior was running. Is there anything I need to do about it?
<marked> I have an issue open for that
<mtntmnky> what filename?
<riley> after some delay items will be manually re-added to the queue, hook54321
<hook54321> k
<endrift> 100 items, 17 GB :)
<endrift> shut down a few of my AWS nodes to save on costs
<mtntmnky> ah okay
<endrift> it hasn't gotten too big yet
<mtntmnky> yeah so you just want the media URL logged to something like _media.txt ?
<marked> since wget is single threaded and there's nothing to parse in media files, we could pull those files away from wget-lua and b e done about half the time
<endrift> but if I'm gonna be running it for four more days, uh, yeah
<marked> yeah
superKB has quit [Ping timeout: 265 seconds]
<endrift> I may leave one or two up indefinitely, we'll see
<marked> then we 2nd it to a 2nd spawning of wget which would be quick to run
<marked> then the crawl should end about half the time, i haven't done the calculation yet
<sep332> when was the last time code got updated?
boutique has quit [Quit: Leaving]
<mtntmnky> marked: testing both changes right now
superKB has joined #tumbledown
<erin> sep332: earlier today, blacklisting notes
<erin> earlier = 5 hours ago ish
arkhive has joined #tumbledown
<mtntmnky> I will submit these separately though
<marked> yeah there's something to coordinate with it
<erine> sanity check my current rate
<erine> 144 concurrency should be around 12 MB/s?
<sep332> erin: ok. i still have a job downloading a ton of ?fromc= but as soon as that's finished I'll update
<phirephl-> erine, yeah
<erin> erine: sounds very reasonable
<_niklas> sep332: that could still take a day
<arkhive> hi. how many connections should i do on a 1Gigabit/s connection dedicated to the grab
<erine> and nobody has gotten rate limited yet?
<phirephl-> arkhive, 1200 and about 1TB of RAM
<erine> I might do something stupid like 300C per box
<phirephl-> erine, I've been running 140 threads on one IP for two days
<riley> phirephl-: how much disk are you using?
<erin> arkhive, phirephl-: don't do that, people ran 600 on one IP and got ratelimited
<erine> ah ok
<arkhive> how many concurrent then?
<_niklas> really?
<erin> arkhive: it's mostly disk and ram limited
<erin> how much of those do you have
<arkhive> and rsync threads
<arkhive> oh
<phirephl-> riley, it's an 8TB VM, using about 700GB
<_niklas> rsync threads is fine as it is
<_niklas> marked, about login, there's two cookies (one for gdpr, one for session), I can browse with those same cookies from two different countries just fine, dunno if it would balk at us for dozens of IPs using the same ones though
<arkhive> the disk on my 32GB RAM computer died so i'm on my old alienware laptop. 8GB RAM :(
<arkhive> about 600GB free
<arkhive> lol i was downloading gangbangsissy.tumblr.com hahaha
<marked> the gpdr on is a click but no user/pass required to generate one?
qw3rty117 has joined #tumbledown
<_niklas> yeah
<_niklas> but, y'know, doesn't deal with safe mode
<marked> kinda one thing at a time, as we get deeper into it, we learn more and it preps us for the next level.
qw3rty116 has quit [Read error: Operation timed out]
<_niklas> huh if I specify an invalid session cookie, I get a bunch of binary garbage
<marked> lol
<_niklas> a lot less data than with a valid one, so it's not encrypted
<marked> it's alergic to that cookie and threw up. might be nuts
<kbtoo> Wow I just saw a upload run at 2MBps nice!
<phirephl-> yeah, I see 2Mbps uploads periodically
<_niklas> oh okay it's a bunch of binary garbage that comes with a 302 for some reason
yano has joined #tumbledown
<riley> _niklas: gzipped error page?
<_niklas> could be but considering I'm not specifying accept-encoding...
<_niklas> anyway. would it be acceptable if I patched my scripts to use a hardcoded session cookie and a non-bot UA? fairly confident this should work fine
<kbtoo> I think i'm finally getting to my connection capacity, browsing is starting to be noticablly slower.
<_niklas> or should I run this against the test tracker?
<marked> you want to work on the login problem?
odemgi_ has joined #tumbledown
<marked> you should probably use the test tracker by the rules
<_niklas> alright
<marked> it's a lonely tracker
<_niklas> anything I need to do to use it besides replacing the url in pipeline.py?
<marked> no it's a 1 line change
<marked> and then type stuff into the webUI
<marked> to load your queue
odemgi has quit [Ping timeout: 252 seconds]
odemg has quit [Ping timeout: 246 seconds]
<teej_> marked: That's brilliant!!!
<teej_> marked: Your idea of spawning wget for media files.
<marked> I'm afraid to ask what I did
<marked> ah, theoretically should work. might have some kinks that's minor for the talent around here
<teej_> marked: So after the separate wget instance downloads the media file, how does it go back into the warc file?
<marked> if we use wget-lua it can create a 2nd .warc.gz and either rsync send them together or they tell me they're cat able together
<teej_> marked: Do you think doing that will be faster than just normally downloading the media files on a single tread?
<marked> it might speed up the crawl quite a bit, but it also means we could add avatars or videos without slowdown
<marked> do you mean grabing the media files on a single server centralize?
<marked> if you running during the avatar 16 days, no mater how fast the media server got things, it interefered timewise
<arkhive> does this grab RAWs?
<marked> we don't know the raw urls anymore
<marked> i hear the old urls for raw doesn't work anymore
<teej_> arkhive: Tumblr stopped allowing anyone from obtaining raw files.
<teej_> A few months ago.
<teej_> marked: I don't understand the "single server centralize" part.
<marked> I wasn't sure if I understood your question right. running 2 features would get the job done faster by using more of the network capacity.
<marked> ^fetchers
<teej_> marked: I agree.
<marked> a reason to not get them on warrior is if we delay getting media until it's centralized and sorted we can be sure to download each one only once
<teej_> So the time saved from multiple threads of wget will be more than the time lost for more rsync/cat?
<_niklas> ok so for using the test server, what's the URL I need to type stuff into?
odemg has joined #tumbledown
<marked> rsync up is the same amount of data
<riley> wishlist: smaller tasks so killing a warrior instance doesn't take three days. eg tumblr:someblog:2018-01
<teej_> marked: How about the cat?
<teej_> I'm just trying to understand the cpu cost of the extra processes.
<marked> pipeline.py goes to server5.kiska.pw:9080 You go to http://server5.kiska.pw:9080/tumblr/admin/queues where you type tumblr-blog:$YOUR_DESIRED_TARGET
<_niklas> thanks
<marked> you're right cat the big files would more HD for no benefit
<marked> so rsync 2 files is the most efficient
<marked> well most efficient is not DL too many files
<teej_> Granted, we want to archive everything.
<teej_> Lol.
<marked> I mean not DL the same thing 2x of 3x of the 5000x reblogs
<yano> has anyone run in to this with the docker container? https://pastebin.com/Rzih2MzK
<yano> that's what I get when I run `sudo docker run -p 8001:8001 archiveteam/warrior-dockerfile`
<teej_> yano: I don't remember exactly, but if continues to run properly, it should be fine.
<teej_> I probably saw something like that.
<yano> it probably is running but when i connect to i get a "connection reset"
<yano> well, "Firefox can’t establish a connection to the server at"
<teej_> marked: You're right. So a queue list can be made and at the end, it can be deduplicated and then downloaded in parallel. Right?
<marked> yeah, but it takes extra man power
<marked> to sort out that logistically
<teej_> yano: Oh! Let me give you my code snippet to make it run without a web interface.
<yano> teej_: well, i want a web interface :-\
raingloom has joined #tumbledown
<marked> which docker file did you DL, I'll give it a run
<yano> i've blown away my installation of Docker
<yano> `sudo docker pull archiveteam/warrior-dockerfile`
<yano> `sudo docker run -p 8001:8001 archiveteam/warrior-dockerfile`
<yano> that's what i'm doing
<yano> and by "blown away" i mean `sudo apt purge docker.io`
<yano> i'm on Ubuntu 18.10 btw
<yano> i did add "graph": "/media/hdd01/docker" in /etc/docker/daemon.json in an attempt to move the installation of docker to my external hard drive
<yano> but even when i removed that file and started over with a `apt purge` and then `apt install` it still fails :-\
<yano> i'm assuming i'm borked something along the way
<teej_> yano: I used `docker run --publish 8001:8001 --env DOWNLOADER="NICKNAME" --env SELECTED_PROJECT="tumblr" --env CONCURRENT_ITEMS=6 archiveteam/warrior-dockerfile` and replace `NICKNAME`.
<erin> hey, not -terrible- for one machine http://0x0.st/skod.png
<teej_> yano: I do remember still being able to access, though. So something could be wrong with your instance config.
<yano> teej_: tried that, same dice :-\
<teej_> Oh...
<yano> teej_: yeah, i've rebooted, and etc, and tried docker via `snap` on Ubuntu
<yano> but ended up reverting back to apt
<marked> wow, what's with that ram
<erin> marked: hmm?
<teej_> marked: I think the Docker container should be fine, because it's supposed to run the same. I think it might be the Docker configuration to connect to the external drive or something.
<marked> it's steadily growing
<erin> i'm running at 200 concurrency lol
<kiska> Did someone say memory leak?
<erin> and the majority of jobs are long, inflight ones
<erine> yano: it should work like that lol
<teej_> marked: On Docker? Or the script?
<yano> i found some stuff about modifiying the Dockerfile to add a line to give root permission to dnsmasq
<erine> I've been running my warriors like this
<erin> it's not monotonically increasing! it's just that most of the jobs are building up RAM usage because they're fucking gigantic
<erine> NUM="00" bash -c 'docker run -d -e DOWNLOADER="erine" -e SELECTED_PROJECT="tumblr" --publish "80$NUM:8001" -e CONCURRENT_ITEMS=6 -e HTTP_USERNAME=username -e HTTP_PASSWORD=hunter2 --volume "/mnt/gv1/warrior-$HOSTNAME-$NUM:/data/data" archiveteam/warrior-dockerfile'
<teej_> erine: Shouldn't it be `--publish 8001:8001`?
<marked> everything's an environment variable except the password
<erine> nah
<yano> oooh
<erine> can't have every warrior on one port!
<yano> so i got docker to stop complaining about dnsmasq by using --privilege
<teej_> Oh.
<yano> but it still complains about, `stty: 'standard input': Inappropriate ioctl for device`
<erine> You using a modded warrior that lets you run 25 concurrent?
<yano> but, hot damn now it is binding to 8001
<yano> ¯\_(ツ)_/¯
<teej_> Oh, does that mean you give Docker root access?
<yano> teej_: basically :-\
<erine> also weird, your docker is not letting you run it without --privileged?
<teej_> As long as those blogs don't have some hacker code, you should be fine.
<yano> erine: yeah, tho i have been running all my docker commands with sudo prepended, so maybe tat's it?
<yano> teej_: hehe
<teej_> I never run Docker with sudo.
<erine> nah, as long as your docker cli can reach the docker socket, anything is fine
<erine> but making your user reach the docker socket = psuedo root anyways
<teej_> But I'm using macOS, so it's probably slightly different.
<erine> docker run -it --rm -v /:/lol_root ubuntu
<erine> Free access to your system's root!
<erine> that vector can be made very dangerous by exposing the docker API port/socket outside of root:root that is unauthenticated
<teej_> I've found Docker on macOS to be very memory/cpu intensive. So I've stopped using it. I've heard that it is more efficient on Linux machines.
* yano starts over without `sudo`
<yano> :x
<teej_> marked: So what lman power" are you talking about?
* kbtoo CBA to fix it for machines that were cobbled together just for this project
<teej_> marked: "man power"
<erine> running on macOS?
<erine> I got something dirty just for you
* teej_ is testing the /me thing.
<erine> just install py3 with homebrew, setup a py3 virtualenv in your home dir
<teej_> It works!
<erine> and install this inside your virtualenv https://github.com/ArchiveTeam/seesaw-kit
<teej_> erine: I'm already doing that.
<erine> more power!
<teej_> It took me hours to figure that out. Lol.
<erine> I wish I had the upload speeds at home to do that, haha
<erine> fucking comcast lol
<teej_> erine: How did you get wget-lua to compile?
<teej_> On macOS.
<erine> doing a compile right now to test my theory
<erine> think it works the same way that warrior does it https://github.com/ArchiveTeam/warrior-dockerfile/blob/master/get-wget-lua.sh
<erine> but maybe move the wget-lua binary to /usr/local/bin/wget-lua for sanity's sake
* erin should get her iscsi bay at home working so she can get her xeon workstation pulling stuff
<yano> woot, got it working
<yano> trying everything without sudo still produced the same errors but this times it actually binds to localhost and i can access the web portal on localhost
<yano> ¯\_(ツ)_/¯
<yano> so, fyi, don't be an idiot and run the `docker` commands with `sudo` prepended
<erine> wha, that should not break
dr3gs has quit [Read error: Operation timed out]
<erine> docker on raw hardware and ubuntu?
<yano> it's an Ubuntu Desktop 18.10
<yano> not a server
<yano> also Ubuntu is whacky because i did run in to a permission issue while doing things without `sudo`
<yano> had to add my user to the `docker` group
<superKB> Oh sudo, when you like to live dangerously
<erine> weird black magic
<yano> but Ubuntu makes you log out of your entire session in order for those changes to take effect
<mtntmnky> alright I'll open a PR for the media exclude soon, seems to work
<mtntmnky> however right now there's an issue
<erine> teej_: so the theory was installing lua and gnutls from homebrew
<ranma> is there an email someone can send nsfw blogs?
<mtntmnky> if it can't write the text files it fails with "Lua runtime error: tumblr.lua:322: attempt to index local 'file' (a nil value)."
<erine> _lua_open and _lua_strlen are objects that couldn't get resolved
<mtntmnky> that happens before I made any changes
<kbtoo> @yano it's actually part of the offical docker install docs as "optional and it implies adding users to the docker group is more of a security risk than just running docker with sudo as docker allways runs as root.
<teej_> yano: Haha. I try not to use sudo when I get the chance.
<teej_> erine: I tried that, and then got compile errors.
<erine> on another hand, I found lua@5.1 on homebrew and I'm trying again with that
<marked> mtntmnky : is your dev environment setting ENV variables?
<teej_> ranma: Email? What do you mean?
<yano> kbtoo: huh, hm, i got a permission denied to access the docker socket
<yano> that's why i added my user to the docker group
<yano> that was the first solution i found on SO :-\
<mtntmnky> marked: yes
<teej_> erine: I have both lua and lua@5.1 installed.
<mtntmnky> maybe I missed one
<marked> I'm looking at the code now
<mtntmnky> yeah, I do have warc_file_base set
<erine> got it!
<erine> replace -llua with -llua5.1 in src/Makefile
<marked> I kinda agree with lua. i don't see where file is defined
<erine> and set LDFLAGS and CFLAGS for homebrew lua@5.1
<teej_> erine: Oh, where is the src/Makefile?
<ranma> teej_: i have a friend with a nsfw blog. never used irc before
<ranma> any place he can send the url to to get backed up?
<marked> oh it's in the call back params
<erine> teej_: In the wget-1.14.lua folder
<marked> wget.callbacks.get_urls = function(file, url, is_css, iri)
<teej_> ranma: Oh. Yes there is a URL.
<teej_> ranma: https://goo.gl/RtXZEq
<marked> so that's whatever the cmd line args were
<marked> are you using pipeline.py ?
<ranma> thanks!
<teej_> erine: Okay I will try that right now.
<erine> Your milage may vary with these CFLAGS and LDFLAGS but this is how I got it to work CFLAGS=-I/usr/local/Cellar/lua@5.1/5.1.5_8/include/lua-5.1 LDFLAGS=-L/usr/local/Cellar/lua@5.1/5.1.5_8/lib
<yano> kbtoo: so it seems like either way you have to trust the docker container you are running
<erine> also huh @ ubuntu desktop's docker
<yano> "Running containers (and applications) with Docker implies running the Docker daemon. This daemon currently requires root privileges, and you should therefore be aware of some important details."
<kbtoo> @yano looks like.
<teej_> erine: I found the get-wget-lua.tmp/ folder. Where is the wget-1.14.lua folder?
<ranma> could someone maybe put that link in the topic?
<erine> oh, using the script?
<erine> get-wget-lua.tmp is that folder
<ranma> the blog-adding one https://goo.gl/RtXZEq
<yano> so i guess it boils down to: either run docker with `sudo` all the time or run as your regular user but add your regular user to the docker group
<teej_> ranma: The lonk is already in the topic.
<teej_> link*
<erine> yano: some clarification about the docker root access thing. it just means that anything that can access the socket will basically be root without the sudo password check
<erine> containers will only have privileges and volumes that you explicitly feed them
<kbtoo> @yano everywhere I have docker running is either in a VM or a machine thats been cobbled together just for this project so i'm not really concerned with it at the moment.
<diggan> "dont run containers like they are sandboxes, use vms"
<teej_> erine: I'm still slightly lost. I need step by step instructions. I cloned the repo. Then what?
<ranma> oh, sorry
<ranma> this is my first few months using KVIrc... used to it in the bar
<marked> mtntmnky : thanks much, those are going to be really useful
<teej_> ranma: No worries.
<kbtoo> I had never heard of KVIrc until this week
<teej_> erine: Thanks. Ikm trying it.
<teej_> I'm*
<teej_> Sorry for the typos.
<erine> \o/
<erine> out of curiosity, what is your home bandwidth rates?
<erine> Seems like it may be good enough to run a warrior on your laptop?
<teej_> Well, my laptop is the limiting factor.
<teej_> My speed is 100 Mbps down / 35 Mbps up.
<teej_> Well that's what my ISP caps it at.
<erin> that should be good enough
<marked> what kind of link is that?
<teej_> I really want a symmetrical 1G connection.
<gchcetiH> probably cable
<teej_> Yes, cable.
<erin> yeah those speeds are cable-flavored
<yano> heh, i got the docker thing installed and running much quicker on debian
<erin> haha
<_niklas> we have a cable provider offering 100 down 6 up
<gchcetiH> trying to archive tumblrs on a 10Mbps DSL connection = :(
<marked> my first broadband was DSL, it had its unique perks
<teej_> I would use -j 8 if I had 8 threads, right?
<erine> yes, or any number below your amount of threads
<teej_> I have a measly 2 cores.
<erine> D:
<marked> ATT DSL back then you could grab unlimited IP because the login servers was configured to ignore duplicate sessions
<gchcetiH> somewhere I heard you should use number of cores + 1
<teej_> My laptop is almost 8 years old.
<erine> as long as it compiles!
<teej_> gchcetiH: I heard the same thing... You might be right.
<teej_> marked: I didn't know that. That's interesting.
<erin> cool, just profiled my local cable connection. 100/11 mbps
<erin> the fuck is this asymmetricality
<teej_> erine: It's in the protocol.
<ranma> cableco (sometimes) overprovisions the speed
<ranma> either they're stingy in the upload
<gchcetiH> guess it doesn't really matter too much
<ranma> or there's something 100mbps in the path
<teej_> I just want a symmetrical 1G connection, and I'll be happy.
<marked> if you login to your cable modem you can see how channels you have bonded
<ranma> i'll take 100 symmetrical
<erin> oh i should figure out how to do that
<ranma> hell, 50
<erin> i have a shit Arris box
<teej_> erine: Me too.
<ranma> you could have american DSL!
<erin> oh lol right our core switch in this house is 10/100, so. that would explain Some Things
<teej_> Lol.
<ranma> 35/1.8 here on centurylink
<ranma> roomies downgraded from >200/>10
<teej_> ranma: Are you near a university?
<gchcetiH> wow, what a shit upload speed
<marked> I prefer university living just for the pipe and food
josho4933 has quit [Remote host closed the connection]
<erin> i used to be on att dsl that was so throttled that getting .4mbps down was a good day
<ranma> not really near a uni
josho4933 has joined #tumbledown
<marked> Comcast is too powerful
<erine> You can always have comcast and have your internet die now!
<ranma> or Charter
<teej_> erine: This is brilliant! It worked,
<teej_> It worked!
<marked> you'd be surprised I'd be surprised, you can take your authed cable modem to other places in teh city and service quality will change by the lines or distance or users or weather or something god knows
<teej_> marked: ISPs are being really cheap now-a-days. They'll charge a premium for anything.
<marked> I know some people don't like Google but I"m glad GFiber exists
<kiska> Meanwhile I have nbn
<marked> is nbn owned by the government?
secuuuu has joined #tumbledown
josho4933 has quit [Ping timeout: 492 seconds]
<_niklas> ok so status report of logged-in downloading testing: yeah it's about 5x faster at least
<_niklas> I'm pulling 11 random blogs concurrently (on test tracker of course) at like 45mbit/s right now
<kiska> marked: yes
<mbp> what is this test tracker thing?
<marked> is there something that makes you think it's logged in vs useragent?
<_niklas> 1 concurrency pulling staff yielded 1350 fetches in 10 minutes (logged-out googlebot did 327)
<_niklas> nah, it's useragent
<_niklas> shoulda clarified, sorry
secuuuu has quit [Quit: http://chat.efnet.org (Ping timeout)]
<_niklas> 10 minutes of downloading with concurrency 11 => 2.6gb data directory
<mbp> tumblr-blog:slangwang 32216 251215872 3325830764 75.53 1-00:19:25 5184/172396 3.01 25
ultraMLG1 has joined #tumbledown
<mbp> 3% done after 24h, nice
<ultraMLG1> so where can I find the actual stuff being archived?
<marked> mbp : the test tracker is for people playing with code so untested code and real jobs don't intersect
<marked> ultraMLG1 : if you can wait DL it from IA
<_niklas> 10 minutes of downloading staff => 37mb data directory with googlebot UA, 187mb with chrome UA
<teej_> erine: Thanks! How did you know how to compile it properly?
<_niklas> staff as in staff.tumblr.com
<mbp> thx marked
<erine> dockerfile and warrior setup files
<erin> enhanced stats over the last 24h (times are UTC) http://0x0.st/skHZ.png
<erine> dockerfiles are cool for figuring out WTF is going on with alien codebases
<_niklas> logged in / chrome UA seemed to see more posts on that, not sure why, given those posts are accessible by url when logged out too
<marked> alien code is half the fun
<horkermon> what inbox rate will overwhelm the infra?
<ultraMLG1> where's IA located at?
<ultraMLG1> I remember seeing it
<marked> California
<marked> SF
<ultraMLG1> the URL I mean
<_niklas> think I should post these findings in an issue?
<teej_> Lol.
theshowmu has joined #tumbledown
<marked> kiska what do you think?
<teej_> erine: So you're a genius!
<teej_> I would have taken a week to figure this out.
<erine> I'm just good at reading the ~~manual~~ deployment scripts!
<teej_> Nope. You're a genius.
theshowmu has quit [Client Quit]
<marked> erine : thanks we promised teej a Mac build after fires but you delivered
ultraMLG1 has quit [Quit: Leaving]
<marked> niklas : I'm thinking it's a tough choice, but yeah arkiver reads Issues consistently and likes a lot of detail
<marked> there's already an issue for login-required
<marked> fill in what you figured out with login-required
<_niklas> alright
<marked> then sure open a new issue with the UA data before we forget the details tomorrow
<marked> erin: are you in data viz or stats by chance?
<teej_> Poor 7-year-old MacBook Air with a finicky power cable and a screen that has a loose internal connection causing it to flicker, with a fan clogged with dust from almost running 24/7 for the past 4-5 years, with its internals replaced and downgraded after I shorted the motherboard with liquid damage by using the laptop as an umbrella when running to the class to take an exam, and no remaining battery capacity whatsoever...
<teej_> With only 2 USB ports that are always used, so no free ones... with the aluminum chassis literally forming holes (little craters) somehow, I think I used to put stapled papers between the screen when I would carry it around, so it's totally scratched up.
<teej_> I'm still surprised the SSD is still working. It should have died years ago.
<Flashfire> lol
<teej_> I remember putting my MacBook in the fridge to keep it cool.
<Flashfire> my 7 year old refurbished macbook pro aint so bad now
<Flashfire> 250GB SSD and 4GB of RAM
<Flashfire> MacBook Pro (13-inch, Late 2011)
<Flashfire> 2.4 GHz Intel Core i5
<Flashfire> 4 GB 1333 MHz DDR3
<Flashfire> Intel HD Graphics 3000 384 MB
<Flashfire> 250 GB
<Flashfire> Solid State SATA Drive
<erine> System Information: Model: MacBook Pro (15-inch, Retina, Touch Bar, Mid-2017) • CPU: Intel Core i7-7820HQ (8 Threads, 4 Cores) @ 2.90 GHz • Memory: 16.00 GB • Uptime: 18 days • Disk Space: 499.31 GB • Graphics: Intel HD Graphics 630, Radeon Pro 560 • OS: macOS Mojave (Version 10.14.2, Build 18C48a)
<Flashfire> wow i have to run high sierra still
<teej_> Flashfire: My MacBook Air is pretty much identical except it's an Air!
<erine> OK, maybe this is a little #archiveteam-ot haha
<teej_> Let's go there.
<rpl> _niklas: but absent account cookies, most of these pages won't serve to a non-googlebot UA right?
<_niklas> dunno about *most*
<_niklas> but a lot of them won't, yeah
<rpl> I think basically anything marked "adult" already
<rpl> or sensitive or whatever
<_niklas> yep
<horkermon> ><horkermon> what inbox rate will overwhelm the infra?
<horkermon> is FOS the primary/only target currently
<_niklas> there's another one in europe
<rpl> how bad would it be to just create a ton of accounts? or like .. does it even notice if one account accesses from multiple IPs?
<horkermon> harc?
<rpl> archiveteam-shared-account.tumblr.com?
<_niklas> one account two IPs is fine, no idea about one account for all of us though
<kiska> Should be fine
<rpl> it's possible their infrastructure is just not clever enough to know the difference
<Flashfire> I think its a bad idea to be honest at least wait until we have all the ones not login protected
<kiska> SketchCow HCross how's the rsync targets holding up?
<Flashfire> If we do this we are revealing ourselves even worse than with the warriors
<psi> rpl: i can say with some level of confidence that it doesn't
<_niklas> it's clever enough to not let you use a session cookie made with chrome on a browser identifying as firefox
<Kaz> god dammit you guys chat a lot of shit
<_niklas> (but not clever enough to invalidate the session)
<psi> Kaz: you're welcome
<kiska> So if we do use a session cookie, we'll need a matching UA...
<Flashfire> I vote we make this account and try that shit after we run the warrior through all non login blogs
<_niklas> yeah
<Kaz> kiska: FOS 7TB free, HCross 5TB free
<bmcginty_> Can someone give me an example o a login-required blog please?
<rpl> Flashfire: assumes you can tell which are non-login-required
<kiska> blog:hentaidjinni
<rpl> bmcginty_: pick any of the ones floating past, 90% odds
<psi> that was suspiciously fast
<Flashfire> the ones that give 0MB I would assume
<bmcginty_> Thanks.
<kiska> Yes ones that give 0MB
<Flashfire> That was one that kiska already had open in another tab ahahahahaha
<horkermon> i'm going to scale enough that i need my own targets but a ballpark for what the straight-to-IA ones will max out on would be useful
<marked> I'll make a test branch but it's safer if you have IPs to burn
<Kaz> horkermon: how hard do you plan to scale
<Kaz> I can almost guarantee you don't need another target
<Flashfire> I have 60GB of windscribe but I didnt think we could use VPNs
<Flashfire> Thats the only way I have IPs to burn
<Kaz> don't use VPNs
<Kaz> fuck
<edgivesup> Something weird is going on
<Flashfire> I haven’t so far
<Flashfire> I’m saying it’s a suggestion but a bad one
<erine> No VPNS :(
<marked> with the data so far, I"m not hearing a great way to use this until all the NSFW are turned off
<marked> because of the login wall
<edgivesup> My super long jobs haven't gone through, yet I'm no longer working on them
<edgivesup> Several of my two day jobs just seem to have either vanished, or compressed from 100GB down to 2GB
<edgivesup> Which I'm not believing is possible
<marked> but at the GPDR wall is overcome
<mbp> someone go on a shady forum and buy 100k tumblr logins
<kiska> lol
<kiska> No thanks
<erine> That would not be necessary.
<Flashfire> I can do that actually
<rpl> (I'm not intending to use a VPN but I'm curious why not?)
<Flashfire> I still have contacts in places
theshowmu has joined #tumbledown
<mbp> i was not being totally serious
<erine> Chance for VPNs to add their own JS, use different IPs, or not resolve at all.
<kiska> rpl: Because I don't want to debug that issue
<rpl> mm, fair
<Flashfire> Guy I used to talk to sold accounts via steemit
<theshowmu> i'm running a warrior instance on my virtualbox thing on my laptop for this but
<erine> Also all the other issues that comes with having an unknown VPN in the middle.
<Flashfire> So I can do it if need be
<marked> so I guess, there's some blog on our list that's not behind a login wall
<marked> so if we could separate them out, we could use it
<rpl> hrmph memory pressure remains the dominant issue here. if I kill a wget it starts from scratch, huh?
<marked> we haven't done a comparison of the crawl URLs
<marked> we should probably do that now
<marked> before expend more energy on it
<theshowmu> if tumblr really does have like, actual child pornography on it, and one of the blogs that's scraped has some images of that nature
<theshowmu> is there like, a process to deal with that?
<psi> rpl: yep
<rpl> grum
<Flashfire> Theshowmu brings up a good pint
<Flashfire> point
<psi> i'll take a good pint as well
<marked> we could check for login wall and chose between 2 UA's
<theshowmu> wait, does efnet cut usernames off at 9 charcters?
theshowmu is now known as showgo_on
edgivesup has quit [Read error: Connection reset by peer]
<marked> efnet, that's my experience
<Kaz> yes
horkermon has quit [Read error: Operation timed out]
josho4933 has joined #tumbledown
<showgo_on> that's a little closer to what I intended
<showgo_on> the rest of my username was supposed to be "stgoon" as in "the show must go on" but
<rpl> showmu is a better name. THE shomu.
showgo_on is now known as showm
<kiska> Kaz: Do we have more lists we can add to the tracker?
horkermon has joined #tumbledown
horkermon has quit [Connection closed]
<teej_> edgivesup: 100 GB compressed down to 2 GB?
<kiska> I feel like that job restarted because it hit an issue
horkermon has joined #tumbledown
<horkermon> back sorry for connection bullshit there.
<teej_> horkermon: When did you leave? Did you switch nicks?
edgivesup has joined #tumbledown
<horkermon> Kaz: I can scale by multiples of 10 of what trvz has been ballparking as tumblr's rate limit per IP (~300-350 concurrent)
<_niklas> ok, done writing down my findings in #2 and a new "Googlebot UA makes our requests low-priority #42" (nice)
<kiska> Yeah I had a feeling that the UA is giving us low-priority
<Kaz> horkermon: and so what's your total concurrency?
<Kaz> kiska: yes
<Kaz> I've just got on the train, how much do we have left in queue?
<kiska> 64k left, not urgent
<Kaz> Ok cool
<kiska> Oh yeah I guess I need to go to the items on IA to see the text files that arkiver made?
<marked> _niklas that's a good write up, better than mine
<Kaz> No idea where he put item lists if that's what you mean
<kiska> This is the line: local file = io.open(item_dir..'/'..warc_file_base..'_data.txt', 'w')
<kiska> And their rsync'd to the targets, but I don't know if they still exist on the targets or in the items on IA
<Kaz> Ah yeah
<Kaz> In the warcs I'd assume
<kiska> No as a file
<erin> okay what the _fuck_ http://0x0.st/skHQ.png
<_niklas> whoops?
<kiska> Kaz: This is the listing from one of the dirs tumblr-tumblr-blog_ao3feed-ds9-20181214-041703.warc.gz tumblr-tumblr-blog_ao3feed-ds9-20181214-041703_data.txt wget.log wget.tmp
<erin> idek what happened!
<kiska> Some big thing was downloaded
<Kaz> If it's going up to the targets with the warcs, it'll be in the megawarcs
<marked> is the limited thing to running more instances RAM or Storage ?
<Kaz> marked: yes
<_niklas> limited on whose end?
<marked> on the PCs or Warrior instances
josho4933 has quit [Remote host closed the connection]
<_niklas> they can both become a problem with big blogs
<_niklas> RAM usage grows slowly but it does grow
<erin> hahahahaha
<erin> yes
<erin> it does indeed
<kiska> My lightsail instance: RAM: 15.3G/15.7G
<erin> kiska: what is your concurrency atm
<kiska> Running 80 concurrent
<erin> i have a lot of blogs that are really big on disk already and it's only at like 8% of posts pulled
<erin> kiska: wow i'm surprised you're not OOMing
<kiska> It is
<marked> _niklas does the version of chrome need to be consistent?
<Kaz> Lol
<teej_> erine: Have you gotten JAA's monitor script to work?
<erine> Yup, already made my changes to it
<horkermon> Kaz: total concurrency easy to scale to 4-figures, still practical in low 5-figures
<Kaz> Humour me - what does your infra look like
<erine> With Redis and a companion watcher! https://pastebin.com/LixCpGfQ https://i.imgur.com/13L3Hjr.png
<Kaz> Because those are serious figures
<teej_> I found out to use greadlink.
<_niklas> marked: yes
<_niklas> I wonder if they have an upgrade path then
<marked> I'm not sure how to put multiple cookies in here
<marked> though there's only 3 Chrome channels
<_niklas> for letting people supply their own cookies?
<horkermon> Kaz: can discuss specifics in pm
<horkermon> primarily concerned with how much should route thru AT's staging rn and how much I should stage myself
<_niklas> you say only 3 chrome channels
<_niklas> but there's multiple operating systems
<teej_> What is the code doing? Oh you have your own monitor?
<erine> Yes. I've evolved from having 8 chrome windows on my screen! :P
<erine> Same functions of the prometheus scraper but it pushes all the console output messages to Redis
<erine> and I have a script that listens to the Redis pubsub channel I'm pushing all those messages to
<kiska> horkermon: Just use whichever target the tracker sends you, we aren't limited by them, its the speed that tumblr lets us use GoogleBot's UA
<teej_> erine: Oh, so this way you can see the progress of everything on one screen? Neat!
<erin> ah, i should do that with my system
<erine> exactly!
<erin> i'm running 100 separate pipelines so that might be
<erin> a good idea
<erine> I have no clue how to scale shit but I'm learning :D
<teej_> You're doing a good job.
<marked> _niklas, could you see if they'll create an account without email verification ?
<erine> only limited by money right now
<_niklas> they do, I created one earlier
<_niklas> err
<_niklas> they don't, I mean D:
<_niklas> you might be able to use it without verifying, didn't check that
<_niklas> if you're thinking about signing up automatically - they have recaptcha
<marked> I'm not sure if we need that many
<marked> but it would be a lot easier without generating emails too
<teej_> erine: So you run the Python file in every project root directory?
<_niklas> oh
trc has joined #tumbledown
<_niklas> you can use tagged emails
<erine> nope, just run the python file on your Redis host.
<erine> and have the prometheus scraper running on your redis/stats host too
<marked> bots@internetarchive.org
<erine> I should probably use environs to set the redis host anyways
<marked> bot+army@
<teej_> Let me try.
<_niklas> yeah, that'll work if it works on internetarchive.org's side
<kiska> _niklas: You did let both test runs complete right?
<marked> do we know who has the largest storage ?
<_niklas> nope
<_niklas> those are 10 minute runs
<_niklas> I can do a complete one of a small blog if you have any on hand?
<marked> our test blogs have been 9volt-art
<teej_> erine: Is it Python 3?
<kiska> I am just surprised at how much of a difference it is
<_niklas> alright, gonna run that one
edgivesup has quit [Read error: Operation timed out]
<marked> I wouldn't call it small , 500 posts.
<erine> 3 but I believe it should be 2 compatible
<teej_> I keep getting connection refused stuff.
<erine> I haven't been using any 3 features IIRC
<teej_> "ConnectionRefusedError: [Errno 61] Connection refused"
<erine> Redis running?
<teej_> Oh. I don't know how to use Redis.
<teej_> I just installed it from pip3.
<_niklas> ok, just started 9volt-art
<ranma> there's no way to use multiple connections on a single blog is there?
<marked> not as of yet
<marked> we're thinking about it
<_niklas> how many URL fetches has 9volt-art been previously?
<teej_> ranma: There is, but it hasn't been made for the wget-lua configuration we have yet.
<ranma> this is annoying getting stuck on large-seeming blogs
<erin> just hit 100GB uploaded! yay :)
<teej_> ranma: That's understandable. I'm also annoyed.
<teej_> erin: Congratulations!
<erin> tyty! using a Huge Box is paying off :)
<marked> let us know if a crawl goes sideways or deathsprial
<erin> i am going to go to sleep and hope that nothing OOMs overnight
<marked> OOMs can be your alarm clock
<erin> and then slam together something based on erine's scripts tomorrow to monitor crawls more closely
<erine> how huge is your box?
<erine> I've been trying to make things work with my tight wallet and slamming together some big scaleways but it hasn't been able to match up?
<erin> erine: 48gb ram, 960gb disk
<showm> also, once something has been uploaded to the main archive, does it get deleted from my side?
<erine> dedi host?
<ranma> mine's 32gb ram 640GB disk
<teej_> Fusl hit 10000 items.
<erine> or something like EC2
<kiska> Once it gets uploaded it gets deleted, and you start a new thing
<ranma> vps
<erin> erine: DigitalOcean
<erine> !
<erin> i was initially doing 10x smaller ones + Volumes block storage
<erin> but i back-of-the-enveloped it
<erin> and this was cheaper
trc has quit [Quit: AndroIRC - Android IRC Client ( http://www.androirc.com )]
<ranma> how much concurrency are you running, erin?
<erin> 200
<_niklas> regarding the difference: it's clearly not throttling, it's requests getting shoved to the back of the queue or something like that. the fastest googlebot fetches are as fast as the slowest chrome fetches
<erin> 2 per run-pipeline invocation
<ranma> lol, shit, i should bump mine up
<erin> i used trvz's suggestions for setting stuff up
<ranma> this was weird
<erin> unfortunately i don't really know how to scale this up more cheaply
<kiska> _niklas: I meant what it grabbed
<kiska> I had a look at the logs
<erin> i'd like to get more throughput on this but i don't really want to spend a ridiculous amount of money on DO
<_niklas> ah, yeah
<erine> my setup currently is two C2Ls and one C2S with 4 150GB volumes on scaleway https://www.scaleway.com/pricing/
<erine> is this bad?
<teej_> kiska _niklas marked: Do you know the answer to showm's question? I think it doesn't get deleted.
<kiska> Once it finishes uploading, it does remove the files from your instance
<erin> erine: definitely cheaper than my setup :p
<showm> great
<teej_> Never mind. kiska answered it.
<erine> gluster overhead is fucking me but it's still lower than the 1 gigabit internal bandwidth :P
<_niklas> I should try it on chrome UA, no session cookie, just GDPR cookie as well
<ranma> anyone seen a file like this?
<erine> just a link post
<teej_> Don't run it!
<_niklas> probably the link detector malfunctioning
<erin> the C2Ls look pretttttty cheap
<erin> i am considering either hacking more stuff together on scaleway now
<erine> wouldn't suggest the C2Ls if you want space
<teej_> It could be a hacker.
<erin> or trying weird kubernetes things on the GCP free trial
<erine> haven't been breaking the 32GB RAM or hell, 16
<erin> i'm chewing through ram like crazy
<erine> if I knew about this a few days ago, haha would use C2S or C2M instead
<erine> more than 8 or 16?
<erin> erine: oh, why?
<mbp> someone should put JAA's monitoring script in the topic
wp494 has quit [Ping timeout: 492 seconds]
<erin> i have 48 gigs of ram and ~20 are currently used
<erine> Fuck.
wp494 has joined #tumbledown
<erin> tfw 200 concurrent threads (´;ω;`)
<erine> you absolute mad lad
* erin is a mad lass confirmed
<erin> why would you use C2S or C2M instead, though, btw?
<erine> my wallet (╥_╥)
<ranma> using 3GB on 12 concurrency
<erine> I'm limiting myself to $30 total spending
<erin> ohh
<erin> yeah
<erine> So far for 5 days, just 15 dollars!
<erin> i figured running this thing for 5 days is not great but
<erin> i'll probably replace it with something else halfway through anyway
<erine> but dear god having this much space with a VM provider is insane
<erine> Filesystem Size Used Avail Use% Mounted on
<erine> localhost:/gv1 1.1T 117G 920G 12% /mnt/gv1
<erin> is that a scaleway attached "SSD Volume"
<erine> gluster
<mbp> giving the warrior just 400MB memory is stingy
<erine> but yeah, with scaleway volumes in RAID0
<erin> did you manually set up gluster or is that just how scaleway volumes are backed
<erine> manual
<erin> why are you using gluster lol
<erine> to glue together their volumes + "dedicated" SSDs
<erin> and not just like, a fixed size block device per dedi
<erine> at my peak, I was at 350GB used
<ranma> if i kill some blog backups to restart with more concurrency, should i readd them to the goo.gl link?
<erine> so I slapped on another node for disk
<erin> hmmm i guess i just don't see a compelling reason to have the overhead of gluster
<erine> Extra 600!
<erine> good point
<erin> do you have a separate vps running the fs server for it, then?
<kiska> ranma: They'll be released in due time
<ranma> 350GB with how much concurrency, erine?
<erine> two warriors at 15 concurrency, pre note patch
<_niklas> ok 9volt-art is done
<kiska> "Premium outbound bandwidth just $0.05/GB" *sigh*
<ranma> when was that patch?
<teej_> ranma: No. They will automatically get re-queued by kiska.
<_niklas> anywhere I should upload data/ from that to?
<erine> kiska: looking at packet?
<kiska> Yes
<teej_> ranma: What patch?
<erine> RIP
Seong has quit [Leaving]
<erine> that's one reason I've went with scaleway too
<erin> i am really tempted by scaleway after having had a very nice experience with them during 500pieces
<erine> "unmetered"
<mbp> JAA did you think about extending the monitoring script to also tracking likes and followers?
<erin> what's up with scaleway's C* instaces being cheaper than the X64-* ones
<erin> i'd think bare metal would be /more/ expensive
<erine> the bare metals are atoms
<_niklas> marked: how long does 9volt-art usually take? just went through in ~17 minutes here
<erin> OH LOL
<erin> OKAY
<erin> good to know
<erine> Intel(R) Atom(TM) CPU C2750
<_niklas> to be fair those are the actually somewhat useful in some cases kind of atoms
<erine> that is the C2Ls
<erin> i have 12 vcpus on this DO box and it's Definitely More Than I Need
<erine> C2S is C2550
<ranma> erine said "pre patch" @ teej_
<marked> 17m is good
<ranma> was that patch in the last 24 hours, erine?
<kiska> I wanna transition myself to something better than the $340/mo DO thing
<erin> with more reasonable concurrency i could totally fit stuff on C* instances
edgivesup has joined #tumbledown
<erin> kiska: hahaha, same
<edgivesup> So if i want to run more than six threads per instance
<kiska> I'll likely stop using lightsail as well
<erin> scaleway is extremely compelling
<edgivesup> What's the best way around the cap
<_niklas> multiple warriors
<edgivesup> Docker directly
<erin> or just run the pipeline directly
<edgivesup> I've ruled out warriors for GCP
<erine> go try scaleway, maybe you two can find a better setup that mine
<mbp> edgivesup check the scripts link in the topic
<erine> one that is less money constrained
<kiska> vultr is shall we say small on the initial disk they give
<erin> vultr's memory is also eh
<erine> also another questionable choice that I believe is questionable is using btrfs raid0 to back my gluster bricks
<erin> that's a lotta damage but it's that's a lotta overhead
<horkermon> kiska: does the items repo reflect the latest done with the lists
<ranma> anyone think 50 concurrency on 640GB is ... not a *terrible* idea?
<_niklas> if you're monitoring it
<erine> knuckles would not be proud of my setup but it works, somehow?
<mbp> dooooo itttttttttt
<erin> erine: gotta learn to scale the wobs somewhere, questionable choices are just... um, learning experiences in disguise?
<erine> wobscale on the cheap
<erine> and probably designed like a fortune 500 bad decision factory
<erin> wob! scale! wob! scale! wOB! SCALE! WOB! SCALE!
<marked> _niklas : i remeber it took an hour on the old scripts
<erine> also yeah, I'm on that IRC :D
<erin> it's weird having my nick just one character off from another person
<erine> but as my normal closeted name
<_niklas> alright
<erine> i've only made that decision because "oh fuck this person has the name I'm gonna take"
<erine> and "oh fuck it's too far to go back on efnet"
<erin> oh no!! sorry
<erine> it's ok LUL
<erin> i was hoping ilianaw would get bandwidth on this, she did a really awesome job with 500px
<erine> but on a more on topic note, I just added two more workers onto my data box, yolo
<psi> how much overhead does docker have as opposed to, say, kubernetes
<erine> extra 16C probably won't be a problem since we aren't in the hell of notes anymore
<erine> k8s has a hilariously large administrative overhead
<erine> just use raw docker
<_niklas> so anyway, again, I have a warc.gz from 9volt-art now, what should I do with it
<psi> yeah, i'm doing that now
<psi> Although I guess k8s scales better
<marked> what's your geography?
<erin> the only reason to use k8s would be if your hosting provider runs the cluster management stuff for you
<erin> cf DO, GCP
<_niklas> mine? server's in france, but the file's 276mb so whatever
<erin> oooooh ram usage is leveling off
<erin> comfy with this
* erin wanders off to sleep
<erine> my inner oh no just went off because I just spun up those workers with username:hunter2 from my example
<marked> I'm not this filesize limit https://transfer.sh/
<marked> but worth a try
<mbp> says 10gb right there
<teej_> Out of curiosity, will we ever archive all of GitHub?
<marked> would that just be git clone everything
<marked> well, I"m not saying you should do this but he test to know is they'l ban your cookie or IP going faster
<_niklas> hm?
<psi> erin: I just have a OVH VPS with an ubuntu install
<_niklas> unfortunately I don't have an idle box with enough disk space I'd feel comfortable running like dozens of workers on
<marked> I'd feel bad if you IP got banned
<psi> The VPS handles 3 dockers with 6 workers each pretty well
<psi> Mostly limited by bandwidth tbh lol
<kiska> I'll be back in 1 hr getting breakfast
<kiska> I also want to start using Chrome UA + generic login
<teej_> marked: IP banned from GitHub? Yeah.
<kiska> _niklas: Can test with a single page how long UA: "ArchiveTeam" takes to process a page, the try GoogleBot and then Chrome UA?
<kiska> If ArchiveTeam UA takes the same as Chrome UA or similar time, I can start implementing a faster crawl
<ranma> is https://gitlab.com/snippets/1789114 any faster than just taking tumblr-grab script and running concurrency [whatever]?
<marked> like teh same login cookie for everyone?
<kiska> We did it for some project not too long ago
Atom-- has quit [Read error: Connection reset by peer]
<_niklas> wait a minute
<_niklas> the GDPR cookie is user agent specific
<kiska> Huh?
<marked> when the login occurs the cookie needs to match
<marked> the later requests
<marked> idk, but that's what niklas testing seemed to imply
<_niklas> yeah
<marked> consistent UA or it'll that a HTTP code
<_niklas> for the session cookie that makes *some* sense
<marked> &throw
<_niklas> inconsistent UA and it'll treat you just as if you're logged out
<edgivesup> Good god there's like 50 different depends for building wget-lua
<edgivesup> xD
<kiska> edgivesup: have fun!
<_niklas> kiska: with an appropriate cookie for ArchiveTeam it takes the same time as with chrome for me
<_niklas> 300-400ms for fetching https://staff.tumblr.com from my home connection
<kiska> Oooh
<marked> how did you create the cookie?
<psi> also wow, Fusl hit 10k items
<_niklas> switched UA in my browser
<hook54321> Is it supposed to grab URLs like that?
<marked> maybe not, but if it doesn't interrupt the crawl not worry about it yet
<erine> erin: Also one thing about Scaleway and scaling at 125C. All 8 threads are will get pegged by wget-lua
<marked> we have bugs that cause problems
<kiska> Bugs that cause problems are more of a concern that bugs that don't
<marked> coudl you screencap what a GPDR page looks like ?
<kiska> A better solution wget that and rsync it to me
<kiska> I need to see what requests it makes
<marked> most accurate would be a wget with warc output
<_niklas> you get 301'd to that URL like immediately
<marked> I believe you the details we might find andge case
<Fusl> marked: here is a HAR of surfing to that web page and then clicking the accept button: http://xor.meo.ws/JO9JbpuycOSq_xSjECtcrZZzq8p-YZZ7.txt
<_niklas> ty
<teej_> If I remove the STOP file, will the pipeline fetch new jobs again?
<Jens> Nope.
<Fusl> search for `"url": "https://www.tumblr.com/svc/privacy/consent",` in that har file, that's the first POST request sent off when clicking the accept button
<teej_> Then how do I make it continue?
<Jens> teej_: Restart pipeline.
<teej_> Darn.
<teej_> Okay. Thanks.
<_niklas> or start a second one
<teej_> I don't want to go to sleep and wake up finding that my computer froze because it ran out of available memory.
<edgivesup> Perhaps I'm missing something obvious, i found a build of wget-lua, git that installed, but now am running into "run-pipeline" not being a command
<_niklas> did you install seesaw
<teej_> edgivesup: Try run-pipeline3
<marked> my ideal would be to use niklas's work for the login required blogs
<teej_> edgivesup: `pip3 install seesaw`
<marked> and find a useragent that does bot bypass at fast return
tiedsnug has joined #tumbledown
<_niklas> pretty sure you get either bot bypass or fast responses
<marked> but I haven't that yet so it's a bit of a dream at the moment
<_niklas> not both with one UA
<teej_> Okay. Goodnight.
<marked> there's a type of scrape used in real time ads
tiedsnug has quit [Client Quit]
<_niklas> I've tried some oembed proxy UA's
<marked> whomever provides ads on tumblr needs priority access
<edgivesup> Ah yes, ye olde python3 requiring i append 3 to bloody everything
<marked> when an ad is called for, the ad network does a crawl of the page and finds something matching based on keywords
<mbp> you are going to have the best process on dec 17th when its too late :P
<_niklas> there's always the rest of tumblr
<ranma> so does https://gitlab.com/snippets/1789114 seem to be safe?
<marked> ranma, we're not as fast as some download apps yet but we're more permanent
<Fusl> try "Mediapartners-Google" or "(compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html)" as user-agent
<marked> what happens on links and lynx ?
<edgivesup> Gah why isn't it happy now
<marked> is there yahoo ad network ?
<ranma> i tried concurrency 50, then 40, but i couldn't start tumblr-grab that high
<marked> they're the most likely to be whitelisted
<marked> or verizon now
<ranma> i just got a screen terminated
<marked> either companies
<_niklas> mediapartners-google: same behaviour as googlebot
<Fusl> "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
<Jens> edgivesup: Which OS are you using?
<_niklas> slurp was the first one I tried
<edgivesup> Ubuntu
<edgivesup> 18.04
tiedsnug has joined #tumbledown
<Jens> pip3 install --user seesaw requests
<Jens> Should get what you need.
tiedsnug has quit [Client Quit]
<Jens> Did you get wget-lua compiled?
<edgivesup> I used the prebuilt ones linked to in the git
<kiska> Repo has wget-lua compiled for Ubuntu 18.04
<edgivesup> There's an 18.04 specific build?
<voltagex_> sigh, lost the other spot instance box
<Jens> Last I checked, it was a shared lib build, so you'd still need various libs installed.
<marked> the thing I dont' get is the Privacy warning alwasy talks about ads but I've never seen on on their site
<kiska> Wget-lua was built using 18.04 since that is what I used
<_niklas> I think they're covering all their bases
<edgivesup> The only packages i could find on the git link are pretty old
<_niklas> they do have ads but maybe not banner ads
<_niklas> just sponsored posts
<edgivesup> The "newest" is built against 13.04
<edgivesup> I feel like there's a disconnect here
<Jens> http://www.goatse.sx/wgetlua.txt <-- works on any debian
<Jens> and ubuntu
<kvikende> that domain name brings back memories
<Jens> People never trust what I put there :D
kev23 has joined #tumbledown
<_niklas> yeah, this is what tumblr ads look like: https://www.reddit.com/r/tumblr/comments/87xsjd/good_ol_tumblr_ads/dwhdey9/
<psi> I'm only using about 2G of RAM for 18 workers :thinking:
<_niklas> psi: *so far*
<psi> Probably lol
<edgivesup> As far as i can tell, the file worked
<edgivesup> But run-pipeline is still jargon to the console
<psi> But, at least I don't have massive disk usage anymore thanks to the PR
<voltagex_> wait, what the hell are all those gcc arguments?
<Jens> voltagex_: Hardening flags.
<voltagex_> for what purpose? 1.14 still has known CVEs in it
<voltagex_> also, gratz on that domain
<Jens> Yea, but it's Good Practice to compile things properly anyway.
<Jens> Original goatse was .cx, but .sx was close enough :D
<mbp> .cz would be the best
<Jens> As mentioned in scrollback long ago, wget-lua needs to be rebased on later wget soon, since it currently lacks >TLSv1 support.
<edgivesup> I don't suppose it would be simpler to find the configuration file for the docker and lift the concurrency cap?
<Jens> Which was added in some time in 1.16.
<kiska> edgivesup: https://github.com/ArchiveTeam/tumblr-grab/blob/master/wget-lua <- this was built 8 days ago
<marked> i wonder i f the UA changes the rendering
<kiska> marked: Looks like you volunteered for that xD
<mbp> edgivesup where is the problem
<marked> what did sign up for this time
<marked> kiska: would it plausible that 1.20 fixes GC or memory leak ?
<Jens> Neat.
<marked> oops
<marked> Voltagex_ _ would 1.20 fix a memory leak or GC problems
<voltagex_> not a clue
<voltagex_> you seem to mistake me for someone who did not just repeatedly hit wget with a hammer until it compiled.
<mbp> is there another way?
<marked> well, you at least saw the change log more than anyone else
<Jens> The good old percussive maintenance.
edgaveup has joined #tumbledown
<voltagex_> marked: quite possible - https://www.google.com.au/search?q=wget+leak
<edgaveup> Ah fuck off Wi-Fi
<ranma> the upside to that snippet is that you can see how many posts a blog has
<edgaveup> Just to be clear, "run-pipeline" is supposed to be a command i can just launch, with the appropriate arguments
<edgaveup> Right?
<voltagex_> edgaveup: yes
drcd has joined #tumbledown
<voltagex_> edgaveup: after installing seesaw in your PATH using pip/pip3
<mbp> if it complains about the version being too old, be sure to do a git pull in the tumblr-grab directory
<marked> we should build it for warrior / debian / fedora
<edgaveup> Up to date
<edgaveup> It's a fresh clone anyway
<voltagex_> build what, marked? and a static build may increase memory pressure.
edgivesup has quit [Read error: Operation timed out]
<kiska> marked: After this grab, maybe I can test...
<edgaveup> Pip3 has no complaints either
<kiska> We have more urgent things to do
<voltagex_> agreed kiska
<voltagex_> edgaveup: what errors are you getting?
<marked> ok what's urgent
<voltagex_> edgaveup: what distro are you on
<edgaveup> Clearly i just can't get shit to work today xD
<edgaveup> Ubuntu 18.04
<voltagex_> edgaveup: I'm happy to help you if you can help me troubleshoot
<voltagex_> edgaveup: at a guess, run-pipeline is somewhere under ~/.local/ and that's not in your PATH
<kiska> raingloom: Unfortunately the way the items are laid out, we can't split it, since we don't know how many posts are in each blog
<kiska> edgaveup: What happens when you run "pip install --upgrade seesaw"
<_niklas> <marked> i wonder i f the UA changes the rendering
<_niklas> here's the googlebot/chrome difference on blog index I mentioned earlier
<kiska> So the tracking script changes *slightly*
<marked> and someone got a reblog in between your two requests
<_niklas> yeah
<_niklas> the one that has newlines in it and is only in the chrome response is https://www.cedexis.com/radar/
<_niklas> wonder why they're excluding specifically that
<kiska> "Real User Monitoring"... So tracking
<marked> the first line is what looks important to me
<marked> what wraps that 1st line ?
<edgaveup> Seems like it was the PATH screwing up
fred77 has joined #tumbledown
<kiska> Time for me to beautify the first <script> line, since I want to be able to read it
<_niklas> what wraps it is that script tag with newlines
<_niklas> that's just missing from the googlebot response
<fred77> hey, any way to see what blogs are being archived? I submited a few a few hours ago and am curious if they are on the list now?
kev23 has quit [Quit: Leaving]
<marked> we dont reload continuously
<kiska> fred77: Its likely we haven't included them yet since we are working on quite old lists
<kiska> And neither do we refresh the form every time someone inserts something
<edgaveup> Finally
<edgaveup> Success
<fred77> kiska: so will they get on the list or is it too late?
<edgaveup> Hopefully, i can now do that all again on a fresh GCP instance
<kiska> fred77: Once todo is <5k we should start inserting more things: https://tracker.archiveteam.org/tumblr/
<fred77> okay thanks
<marked> nobody knows fred77, we'll do as much as possible intil it cuts off
JL421 has joined #tumbledown
<marked> if you need a personal archive, you should do that
<voltagex_> marked: is there a better script to do a personal archive?
<_niklas> the first parameter in that first /impixu href gets a lot more data with googlebot, for some reason
<voltagex_> Jens: what wget-lua source are you using when you build?
<Jens> Ancient sources linked in the tumblr project
<marked> I haven't used any. I"ve heard of three. 2 are listed on the wiki
<kiska> _niklas: I wish I could tell you why it does that
<marked> someone here said they have a web server that will do a crawl by API
<Jens> I tried briefly to make a static compile, but I'm way out of my comfort zone here, and my hammer approach isn't working.
<marked> If non tech users
<_niklas> for less tech users, there's tumblthree
<_niklas> GUI app
<kiska> marked: I made a new branch called UAtesting
<marked> cool
<edgaveup> Thanks for the help, I'm going to go and configure some instances
<edgaveup> Oh wait
<marked> we'll look for you on the board
<edgaveup> Is there an argument to tell it where to dump it's scraped data?
<_niklas> does the tracking thing seem important? the crawl differences surely come from elsewhere
<marked> no tracking is better actually but it doesn't matter
<edgaveup> I'd prefer it dump to the physical NVME scratch disk instead of the virtual system boot drive
<marked> IA breaks all that stuff
<kiska> edgaveup: It'll put the data where ever you have tumblr-grab
<edgaveup> Fair enough
<edgaveup> I'll just copy tumblr-grab directly into the scratch disk
<marked> so how many types of blogs do we need to test
<Jens> You can symlink "data/" to wherever you want.
<marked> sfw, nsfw, login required ...
<edgaveup> How much RAM was ideal per thread again?
<kiska> My recommendation was 1GB per job
<marked> cd data
<kiska> You can scale all you want
<kiska> Just be aware of the OOM-killer
<edgaveup> So doing concurrency 25 on a 4GB instance might not end well?
<edgaveup> xX
<edgaveup> xD
<JAA> mbp: No, I didn't think of extending my script to likes/followers yet. I'll look into it later.
<JAA> Also, if anyone has changes that could be beneficient to others, e.g. Docker and/or warrior VM support, I'm happy to integrate them into my upstream script so we don't end up with a dozen different versions.
<kiska> JAA: Who has access to the archiveteam.org email?
<mbp> JAA https://linx.li/8f32h495.sh this works for the docker/warrior, credit to mtntmnky
<mbp> only difference is accessing the log and warcfile inside the docker
<mbp> oh and the dependencies, obviously
<mbp> apk add bash procps sed gawk curl grep coreutils
<mbp> that should cover everything
<mbp> also maybe divide the warcfile size by 1024² to get a more human-readable output
<Ryz> According to the tracker, there are currently around less than 58,000 items left
<marked> we have more unloaded
<Ryz> Is there going to be more to archive that would increase that number?
<kiska> Definitely, we estimated like 250m blogs that tumblr hosts
<kiska> The amount of nsfw is unknown but I'd say 90% or more
<Ryz> That's beyond Tumblr NSFW stuff?
<_niklas> I'm looking at the crawl URL differences
<psi> At least we're doing >1k/h now
<_niklas> I *think* this is related posts
<_niklas> suggestions varying by UA
<ranma> here's what that snippet with tumblr-monitor looks like https://pastebin.com/raw/B3uzZEuF
<ranma> nice seeing how many things there are to leech
<mbp> feels like the scripts are running pretty well now
<kiska> It'll probably run better with a different UA
<psi> mbp does that script run inside or outside the docker
<kiska> I am going to try "ArchiveTeam"
showm has quit [Leaving]
<mbp> i am running it in the VM
<mbp> which would still be outside of the docker
<psi> hmm
<JAA> kiska: SketchCow does.
<mbp> apk add openssh
<mbp> and so on
<kiska> I was going to point a new tumblr account at an archiveteam.org email
<psi> I'm running it on a VPS using the dockerfile
<mbp> then i dont see why it wouldnt work
<marked> did we figure out if we need to verify accounts ?
<kiska> marked: Making an account now
<JAA> mbp: Thanks. I probably won't include the installation directly in the script but add it to the readme.
<marked> it didn't seem to me to matter which email you gave it, it was just a login method
<_niklas> reminder tumblr lets you do email+tag@ if the mailserver supports it
<psi> oh yeah that works beautifully mbp
<mbp> run it with | sort -n -k 8
<psi> oh there's a blog with 116k posts in there
<psi> nice
<kiska> marked: Yes we do need to verify email's
<JAA> Just use some email you control. I created one on some random free provider in the past for this.
<kiska> Apparently I already used archivebot01@gmail.com....
<trvz> append +something
<ranma> where's that script, psi?
<ranma> or how do you get notes?
<ranma> er mbp
<psi> lol
<ranma> ah kk
<psi> hmm
<mbp> 10minutemail.com
<psi> could i have this constantly running to act as a dashboard
<_niklas> okay I diffed a post fetch between googlebot and chrome UA
<psi> or is it too memory intensive for that
<_niklas> chrome gets a lot more "related posts" crap
<_niklas> that's where the crawl differences are from
<voltagex_> use watch -n 600 to run it every 10 minutes or so
<JAA> psi: That probably won't work well. I run it once per minute on my machine.
<psi> more importantly, could i expose it to the internet so I don't have 3 warrior tabs open constantly ^-^
<mbp> in a perfect world it would be integrated in the webui :)
<psi> sadly it isn't a perfect world
<kiska> So more bs... Right we might just want to grab them to discover more, I guess....
<psi> but one overarching web dashboard would be amazing
<_niklas> that might not actually make much of a difference for a full crawl
<_niklas> related posts are mostly (only?) on the same blog
<_niklas> I forget if they redirect to the canonical URL or cause duplication, but you'll be getting the same posts still
<psi> related posts are always same blog
<_niklas> it does make the responses an awful lot bigger
<_niklas> 132kb with googlebot vs 228kb with chrome
teej_ has quit [Ping timeout: 252 seconds]
<edgaveup> That only took like 15 mins of setup
<edgaveup> It's off scraping now
pnJay has quit [Ping timeout: 260 seconds]
diggan has quit [Ping timeout: 260 seconds]
voltagex_ has quit [Ping timeout: 260 seconds]
diggan has joined #tumbledown
psi has quit [Ping timeout: 260 seconds]
Dj-Wawa has quit [Ping timeout: 260 seconds]
<edgaveup> Nice! First successful upload from the instance on the tracker
HCross has quit [Ping timeout: 260 seconds]
pnJay has joined #tumbledown
voltagex_ has joined #tumbledown
psi has joined #tumbledown
Dj-Wawa has joined #tumbledown
HCross has joined #tumbledown
trvz has quit [Ping timeout: 260 seconds]
echarlie has quit [Ping timeout: 615 seconds]
levelch has quit [Ping timeout: 615 seconds]
octarine has quit [Ping timeout: 260 seconds]
<mbp> these kinds of links are being hit a lot, just with different unix times
trvz has joined #tumbledown
<horkermon> deploying a tracker for testing the massive scaling i'm intending isn't going great via https://www.archiveteam.org/index.php?title=Dev/Tracker
<horkermon> is there more updated info on that
<kiska> Tracker testing is meant to be small
octarine has joined #tumbledown
<kiska> No it was written for Ubuntu 12.04 I think
<horkermon> is yours running on that?
teej_ has joined #tumbledown
<kiska> Yes in a 256MB VM
<kiska> The official tracker is run by someone else and has different configurations
echarlie has joined #tumbledown
levelch has joined #tumbledown
<kiska> _niklas: What cookies are required to get past the login stuff?
<horkermon> i want to get the queueing of giant lists handled asap and it's only sane for me to scale at 4 and 5-figure concurrency with that in mind, so that's what i mean by testing
<kiska> I'll wait for Kaz to get on so he can add them to the queue
<kiska> I don't think it'll be a problem for the tracker nor the rsync targets
<voltagex_> few hours late, but I've got another 120 concurrent running
<Jens> voltagex_: How tested is your 1.20 wget-lua?
<Jens> I'm assuming "barely".
<voltagex_> none
<voltagex_> went by the wayside when I realised when the deadline was
<Fusl> i just built wget-lua against wget 1.20.34-eaeef-dirty, /looks/ like everything is working the way it should
<Jens> Well, it'll be useful in the future.
<Jens> 1.14 is embarrasingly old.
<Fusl> "embarrassingly old"?
<Fusl> 1.14 is SHOCKINGLY old.
<Jens> It was originally written on stone tablets.
<voltagex_> Fusl: how many bloody patched wget trees do we have? Remind me not to spend 4 hours on it next time
<Jens> I didn't know there was any.
<psi> Oh my.
<Fusl> voltagex_: mine is actually just wget-lua with fast forward of the official wget tree
* voltagex_ is off to play BFV and not think about this for a while
<psi> refreshed every minute and a half
<voltagex_> Fusl: I'd be interested in how you did that because I had to fight git every second of those hours.
<Fusl> `git merge`
<voltagex_> that's what I did too.
<Fusl> welp
<voltagex_> I had so many merge conflicts.
<Fusl> i had just a few
<Fusl> like 4
<Fusl> you might have git merged it the wrong way around
<Fusl> you know. it makes a difference which way you merge it
<voltagex_> okay, thanks
<Fusl> so what i did was clone the mirror/wget repo, then add the alard/wget-lua as "lua" remote, then just "git fetch lua" and "git merge lua/lua", that gave me just a few files to deconflict
<voltagex_> balls
<HCross> kiska: how is FOS holding?
<kiska> HCross: I have no clue
<HCross> Ah right. I got an offer from DataPacket to replace the HDDs with smaller ssds
<kiska> *smaller* SSD's
<Fusl> your "Fix duplicate definitions of http_stat" was just an incorrectly solved merge conflict, i happened to fall into that trap as well and just had to move the updated struct from http.c to http.h
<kiska> I don't think that would be productive, since we are about ramp up
<psi> by the way mbp would it be possible to add the # of requests done by the warrior to the tumblr monitor or is that too much to ask
<kiska> I want to change the UA from GoogleBot to ArchiveTeam, and that should speed up the transfers significantly
<HCross> kiska: it's the fact that we're uttely slamming the HDD array
<kiska> We are?
<Fusl> get some ssds as write cache :P
<HCross> I only have 4 bays
<kiska> _niklas: What are the names of the cookies I need to pass to wget?
<Fusl> HCross: pcie ssds it isthen :D
<mbp> psi, JAA wrote the script, im just a user
<psi> fair enough
<HCross> Fusl: £££
<kiska> How much smaller would the array be? And what is the downtime?
<HCross> Going from 8tb to 2tb, probably an hour or so
<kiska> Hrm...
<kiska> RAMdisk?
<HCross> Not enough ram
<Fusl> zfs set sync=disabled --all
<kiska> HCross: What is your traffic like? I can disable your rsync target
<kiska> Or do you think that 2TB is going to be enough?
<HCross> I'm seeing 800Mbps in, 300Mbps out
<kiska> Ok the other question is how full are the disks?
<_niklas> kiska: pfg (gdpr) and pfx (session), you'll need someone with an EU IP to generate the former
<_niklas> I'm afk now and will be for a while, sorry
<kiska> Then I shall use a VPN to the EU!
<kiska> Grab those cookies then we shall crawl with speed!
<Fusl_> i posted a HAR file earliee
<HCross> kiska: pull me out asap please
<HCross> SSD array is full too
<HCross> I'll let this all wash down, then re add me in a bit
<kiska> Deactivating "rsync://archiveteam.hawc.eu/tumbledown/:downloader/"
<HCross> Thank you. Traffic is dropping slowly now, but the ssds are 99% full
<kiska> Hope SketchCow target can keep up
<HCross> I was seeing nearly a gigabit in
<edgaveup> I was about to ask if that meant he was about ti see all the traffic at his doorstep
<HCross> Once this has pushed off to the HDD then we can revisit
<HCross> SketchCow: another 1Gbps on the way to you
<kiska> Should I try and spin up a target on vultr's dedicated SSD instances?
<HCross> You'll hammer bandwidth
<kiska> 40TB enough?
<kiska> It'll only run for a few days at most
<kiska> HCross: Wanna ask them for a 8 bay server?
kode54 has quit [Quit: ZNC 1.7.1 - https://znc.in]
<HCross> I will, but the cost will be huge
<erine> First time with ncurses!
kode54 has joined #tumbledown
<kiska> We have traced where the slowness is, I am implementing a fix shortly, just testing
pornmogul has joined #tumbledown
<kiska> SketchCow: How are your drives? And how does bandwidth look?
Knuckx has joined #tumbledown
<SketchCow> So, I split creation of megaWARCs and uploading
<SketchCow> And didn't make it restart
<SketchCow> As a result, there's a backlog
<psi> erine: real shit?
<erine> but a little emphasis on the shit part
Knuckx has quit [Client Quit]
<psi> oh now i'm just sad
<psi> but yeah that's pretty sexy
<kiska> SketchCow: How full are your drives? HCross has to pull out
<SketchCow> 50%
<SketchCow> But I'm running the uploads now
<erine> pairs with the modded prometheus importer https://pastebin.com/pikuKGrH
horkermon has quit [Quit: Leaving]
<psi> hmm
<psi> if i run this code manually it works
<psi> put if i plug it into gotty it doesn't
<psi> but*
horkermon has joined #tumbledown
<erine> it also likes crashing when your terminal is too small vertically
<psi> no not your code dw
<erine> ah ok
<psi> `$GOBIN/gotty -w -c ThePsionic:RuneScape12 --title-format "Psi's AT Warriors" watch -n 90 "bash /home/psi/tumblr-monitor.sh | sort -n -k 8"`
<trvz> erine: wtf is that font
<erine> comic sans ms
<psi> oops
<psi> time to change my PW :^]
<erine> what do you mean, I see *********?
<psi> funny
<erine> but don't worry, we won't hijack your warrior settings
<erine> can't say the same about the lurkers
<odemg> HCross, your nic intf is enp2s0 right?
<SketchCow> So, first, fucking congratulations, I can see the spike
<psi> but yeah why would `$GOBIN/gotty --title-format "Psi's AT Warriors" watch -n 90 "bash /home/psi/tumblr-monitor.sh | sort -n -k 8"` not output any data
<SketchCow> We're going to miss so much but we've gotten so much
<psi> while it runs properly manually
<kiska> We should hopefully speed this up
<SketchCow> Second, I've got two batches now running to draw the 5tb on the box into the archive
<SketchCow> FOS is FOS, it has always been this way
<SketchCow> Dependable in some ways, slow in others
<SketchCow> I've again brought up how this is criticial stuff and I wish it had more power
<HCross> odemg: yes
<odemg> HCross, sound
<HCross> I've got a quote from datapacket to upgrade the box
<trvz> so people shouldn't donate to the IA until FOS is better?
<kiska> And?
<odemg> HCross, info?
<HCross> $10 a month more to replace the 2x4tb drives with 2x1tb SSD
<kiska> And how much are you already paying for it?>
<odemg> does that make sense? you're on 1Gbit so not bottlenecked by the rust?
<HCross> A lot
<odemg> HCross, you got a btc address?
<kiska> odemg: He has 44% iowait xD
<HCross> odemg: it is the rust, because it can't cope with being hammered with megawarcs coming off the SSDs, onto the rust, then off to the world
<odemg> ahh
<odemg> fucked up shit is I've got 2x8TB nvme drives sat feet from your server, been there 5 months waiting to be installed in the-eyes server
<kiska> So do we know any warc diff'ing tools? I need to diff the warc from grabbing with GoogleBot and with ArchiveTeam and cookies
<odemg> there remote hands/your hardware policy and cost is ballocks
<Nemo_bis> On a machine with 16 cores, 120 GiB RAM and 4 TiB disk, increasing the dirty ratio worked wonders for me to increase disk throughput https://unix.stackexchange.com/a/41831/59808
<Nemo_bis> Otherwise those warc files keep bouncing from memory to disk
<SketchCow> kiska: Remember, you can always enjoy the hell of watching the FOS directory directly: http://fos.textfiles.com/tumblr/
<trvz> HCross: are you on E3 v5/v6 and 32GB RAM?
<JAA> odemg: 2 x 8 TB NVMe? Wow, that must be expensive.
<SketchCow> current is "being built up" set from uploads. archive is where the magic happens
<SketchCow> ALTERNATE and DOWN are two separate directories I set so three different sets of files can go up
<odemg> JAA, free actually SG/Samsung like me a little bit, sometimes
<JAA> Oh, nice.
<HCross> odemg: how much is remote hands
<odemg> 80eur last I asked, however that was just first line and I didn't follow it up with my rep
<kiska> JAA: I want to change the UA so we can get better speed + login-walled blogs, so I am making a grab with GoogleBot as the UA, and I am going to make another grab with "ArchiveTeam" and cookies
<SketchCow> So, I'm going to go back to bed for a tad
<SketchCow> Since it's 3am
<odemg> zzz
<SketchCow> And I happened to wake up and I just realized FOS was probably done with the last batch
<erine> One thing I would suggest testing with logged in cookies
<SketchCow> (It will continue to work while I sleep if it gets through the back)
<odemg> and I'm going crimbo shopping, catch you all later <3
<erine> Will 200-300+ concurrency get us rate limited faster?
<SketchCow> But yeah, if there are other hosts like HCross who can be targets, have that at the ready
<horkermon> How is snatching FOS contents best done
<SketchCow> Explain snatching FOS contents
<HCross> SketchCow: I'm out for a few hours at least, while I smash through this 2tb queue and then potentially get hardware upgrades sorted
<SketchCow> IN THEORY we will hold up
<horkermon> They get megawarc'd and upped to IA but download from after that point is going to be prohibitively slow
<trvz> is there a chance that a lot more warriors will join the pool?
<horkermon> my overall aim rn is to aggregate as much as possible without much concern for whether it's IA-bound, bc I have an archival nonprofit so later aggregation is a given
<SketchCow> Well, first, I just killed http://fos.textfiles.com/tumblr in the off chance someone is direct downloading, that's bad
<SketchCow> People who are not talking crazytalk can look at http://fos.textfiles.com/pipeline.html to see how disk space is going
<SketchCow> (You want to be watching /2, that has the stuff)
<SketchCow> And now I have to ask horkermon exactly what the dilly-fuck they are up to.
<SketchCow> Horkermon, what the dilly-fuck are you up to
<Nemo_bis> horkermon: if you want to download down the road, you should use the torrents (and keep seeding them so that other downloaders have a chance too)
<horkermon> trying to maximize collection before the deadline is all
<horkermon> nothing dodgy
<trvz> think you should read the tutorial again
<erine> all data will eventually be uploaded to the IA
<SketchCow> If you are doing something FROM FOS, then you are certainly not
<erine> FOS is the holding place for all the shit we've downloaded
<erine> as in it's already safe
<horkermon> no i'm not concerned about the contents
<HCross> My Prague machine is streaming data out to the IA rapidly now
<horkermon> the metadata says what's already done
<erine> torrent the finalized WARCs!
<erine> also it's sorta sad the new warcs don't have david karp thumbnails anymore
<SketchCow> They'll get the thumbnails when I have time
<erine> <3
<horkermon> eh the CDX etc will say enough
<psi> welp, the docker based tumblr-monitor doesn't play nice with watch on my machine
<psi> oh well
<SketchCow> OK, anything else?
<SketchCow> I'll be back up in another 4-5 hours anyway
<foureyes> my clients won't upload to fos... @ERROR: max connections (120) reached -- try again later
<SketchCow> (I just went to bed at 8pm for some reason, hence up again)
<kiska> SketchCow: Before you go, can you fix the rsync issue?
<SketchCow> Oh
<SketchCow> You mean
<SketchCow> How the machine is strapped
<SketchCow> So we can "fix it"
<SketchCow> By doubling connections
<SketchCow> That issue?
<kiska> xD
<SketchCow> Hey, this waitress seems really weighed down by 120 glasses
<SketchCow> Let's put 120 more glasses on her
<Nemo_bis> But on a nice plate!
<SketchCow> MUCH nicer plate
<kiska> foureyes: It'll retry every 30 seconds until rsync can upload, we are having some.... capacity issues
<JAA> Yeah, ArchiveBot's blocked by it as well.
<SketchCow> We need another target!
<JAA> Indeed
<SketchCow> Also, jesus the archivebot thing pisses me off
<foureyes> kiska: i see. i'll just let it retry then.
<SketchCow> If I stop and start rsync, what happens
<SketchCow> It's entirely possible we could have a pile of dead connections
yeah568 has quit [Read error: Connection reset by peer]
<JAA> Probably the existing connections are broken and the clients will retry the upload. Whether there will be any partial uploads remaining (with those random temporary filenames rsync uses) I don't know.
yeah568 has joined #tumbledown
<SketchCow> Let's find out!
<JAA> rsync: read error: Connection reset by peer (104)
<JAA> rsync error: error in socket IO (code 10) at io.c(785) [sender=3.1.1]
<JAA> Process RsyncUpload returned exit code 10 for Item tumblr-blog:awesomeabduction
<JAA> Failed RsyncUpload for Item tumblr-blog:awesomeabduction
<JAA> Yup, so far so "good". :-)
<SketchCow> There you go, weasels
<SketchCow> Rsync restarted
<JAA> Care to elaborate what pisses you off about ArchiveBot?
<SketchCow> When I say "Man, Tumblr is really eating resources, the machine is slow"
<SketchCow> And someone goes and backs up google.com
<SketchCow> Wingman'd by his pals backing up everything.com and whynot.com and heyitsaidIcoulddoit.com
<SketchCow> No coordination
<trvz> couldn't you put up a dedicated rsync target for archivebot?
<SketchCow> I do and did
<SketchCow> Normally archivebot is the only major thing
<SketchCow> So I don't want to constrict it to a 1.5tb drive
<kiska> Failed LimitConcurrent(<shared:rsync_threads:20> x Upload ) for Item tumblr-blog:erotic-eye-candy18
<SketchCow> That shares the system
<kiska> Uhh what?
<JAA> Yeah true. Some people seem to have missed the message although I mentioned it a dozen times in #archivebot and also in #archiveteam for visibility.
<Sanqui> the internet is too large
<SketchCow> So that's what pisses me off
<trvz> no, I meant dedicated as in metal
<trvz> doesn't IA have servers and HDDs lying around?
<SketchCow> Oh, that'd be a no
<SketchCow> Ha ha no no no no no
<SketchCow> Internet Archive runs very close to the bone
<SketchCow> Occasionally going forward
<JAA> SketchCow: I'll limit !a to ops. That should prevent certain people from queueing bullshit.
<SketchCow> For example, we have 3.6 petabytes of empty space purchased right now
<kiska> rsync: change_dir "/root/tumblr-grab/data/15447610682e078106879a9dda-18/" failed: No such file or directory (2)
<kiska> rsync error: errors selecting input/output files, dirs (code 3) at flist.c(2122) [sender=3.1.2]
<kiska> I see
<SketchCow> JAA: That would be appreciated.
raingloom has quit [http://www.mibbit.com ajax IRC Client]
<SketchCow> Helps if you have an error message
<SketchCow> Like "Go ask someone to do !a for you"
<kiska> HCross: ping me when you have prague up and running again, if I don't respond, ask JAA to call me
<HCross> Will do
<SketchCow> Yeah, FOS will do what it can
<SketchCow> But it's going to be tough
<SketchCow> Finally, a fuckin' nailbiter!
<JAA> kiska: FYI, I also have tracker access now. (Cc HCross)
<JAA> So I can add the target when it's ready.
<SketchCow> teamarchive2:/etc# ps -ax | grep rsync | wc -l
<SketchCow> 302
<SketchCow> I upped it from 120 to 150
<SketchCow> Everyone will pay the price
<kiska> JAA: You just need to reactivate it
<JAA> Yeah, or that.
<SketchCow> avg-cpu: %user %nice %system %iowait %steal %idle
<SketchCow> 24.10 0.00 21.33 50.69 3.32 0.55
<SketchCow> avg-cpu: %user %nice %system %iowait %steal %idle
<SketchCow> 35.22 0.00 20.31 32.13 12.34 0.00
<SketchCow> OK, back in a bit to clean up all your bodies
urjaman has quit [Ping timeout: 260 seconds]
<SketchCow> One last thing
<SketchCow> .... I see twitter DMs faster than any other methods right now
<SketchCow> So DM me for near -insta reaction when I wake up
urjaman has joined #tumbledown
revi has joined #tumbledown
Ryz has quit [Remote host closed the connection]
kiska1 has joined #tumbledown
<pnJay> Just waking up now, did we patch again after the notes nix? If not I think my warriors are broke for some other fancy reason.
<JAA> pnJay: Nope, that was the last code update.
<pnJay> How exciting. Nothing like the smell of troubleshooting in the morning
<JAA> If your uploads are failing, that's normal.
<JAA> It'll keep retrying until it succeeds.
<upshift> My warrior vm has slowed to a crawl, 96% iowait
<upshift> Does it need more RAM? Only has 400 MB
<urjaman> my experience was that it ran with 400MB but swapped enough to be annoying so I gave them all 1GB ...
<JAA> Running at what concurrency?
<upshift> I am using the defaults, 2 concurrency
drcd has quit [Leaving]
<JAA> Any idea how old the jobs are?
<upshift> 21 hours
<JAA> Mhm, that's the old scripts then. Yeah, that won't end well probably.
<JAA> With the new scripts from ~14 hours ago, RAM usage has decreased significantly.
<kiska> Hrm.. that is some serious RAM usage
<upshift> Should I shut it down and restart? Any way to keep what it has so far?
<trvz> yes and no
einswenig has joined #tumbledown
<upshift> Ok I will shut it down and also increase the RAM on the VM
<einswenig> I guess you know already that fos.textfiles.com is at its limit for rsync connections?
<psi> erine: you 'round?
<kiska> yep we are
Frinkel has joined #tumbledown
<Frinkel> Hey, I'm running into a problem with one of my warriors - it's trying to upload an archive, but it's constantly running into an error stating that the max connections - 150 of them - have been reached. Is this a server-side issue or is there something wrong on my end?
<kiska> This is an issue with FoS, it'll *eventually* upload
pornmogul has quit [Quit: Lost terminal]
<Frinkel> Alright, just wanted to make sure
<kiska> I've got one finishing in 25 mins
<psi> Does uploading keep retrying infinitely if it fails?
<kiska> Yes does infinitely try
<kiska> And I am very hesitant to push my ArchiveTeam UA
<psi> Ok, at least the data doesn't get lost after X failed retries then
trc has joined #tumbledown
<kiska> RIP I've killed my lightsail instance xD
bmcmath has joined #tumbledown
<kiska> Yep machine OOM kill'd itself
<upshift> To increase the concurrency do I need to shut down and restart again?
<urjaman> i think it will eventually notice it being upped, though i dont know when...
<urjaman> (atleast it did for me once, but maybe that was a side effect of an update or something)
<kiska> What do you mean up the concurrency? Like run more jobs?
<HCross> odemg: did you know what datapacket remote hands are?
<urjaman> if you change the concurrency limit in the warrior settings, when does it actually launch more jobs?
<pnJay> Okay yeah theyre running fine, I just finally got big jobs I guess. Thanks tumblr-monitor!
<upshift> I changed concurrent items from 2 to 6 in the settings and clicked save but it still shows only 2 items in the current project tab
<kiska> It should eventually
<upshift> ok great
<bmcmath> If I want to run two warrior vms on the same host box, how do I access the web interface on the second one to choose the tumblr project?
<urjaman> bmcmath: change the port forwarding settings for the other VM (see wiki FAQ)
<bmcmath> thanks
<bmcmath> perfect.
<bmcmath> Does anyone know if it looks like all the items will be complete by the 17th deadline at the current rate?
<JAA> FYI, the default project is Tumblr anyway, at least for now.
<kiska> bmcmath: Hell no, we are through almost 100k, and there is an estimated 250m blogs of which I estimate 90% are nsfw
<JAA> I have one job here that still hasn't managed to upload in 1.5 hours.
<kiska> Good news! In 10 mins I have a slot opening!
<JAA> I'm sure I'll get lucky and grab that one. Not.
<kiska> I forgot how bad bandwidth is from the EU to AU: 149,880,832 52% 189.55kB/s 0:11:57
<trvz> where's that 250m number from
<trvz> is that all the active ones?
<JAA> In other news, my longest-running job at almost 4 days is nearly finished. :-)
<bmcmath> kiska: I guess I assumed the 48k to go on the tracker was the total left.
<kiska> And sorry 450m blogs
<kiska> https://www.tumblr.com/about 451m sorry
<psi> 2 blogs waiting to upload now ;P
<pnJay> So am I going to break anything if I throw more hardware at this?
<pnJay> I know fos is eating it right now
<trvz> tumblr is having more new blogs created *now* than we can archive
<trvz> 300k in two days
<kiska> pnJay: probably break fos even more
<JAA> Everyone switch to more and smaller pipelines so that we can hit FOS even harder. ;-)
<kiska> JAA: Doesn't PurpleSym have access to the collection?
<JAA> No, it won't break FOS even more. FOS only accepts 150 connections, everything above that simply fails with an error.
<urjaman> it would be nice if the warriors could "decouple" the downloading from the uploading ... (i would have enough disk space to just download until the 17th...)
<PurpleSym> Which one?
<JAA> So it won't speed up the archival in any way.
<JAA> pnJay: ^
<kiska> PurpleSym: archiveteam_tumblr
<pnJay> Roger roger.
andx has quit [Read error: Operation timed out]
argus has joined #tumbledown
Fusl_ has joined #tumbledown
<kiska> Yes
Tenebrae has joined #tumbledown
PurpleSym has joined #tumbledown
noirscape has joined #tumbledown
uberushax has joined #tumbledown
<JAA> You could use the extraction tool from warcat and then diff -r the directories, probably.
sHATNER has joined #tumbledown
scrottie has joined #tumbledown
mtntmnky has quit [Read error: Operation timed out]
yano has quit [Read error: Operation timed out]
yano has joined #tumbledown
schnits has quit [Quit: Yaaic - Yet another Android IRC client - http://www.yaaic.org]
trc_ has joined #tumbledown
fred77 has quit [Read error: Operation timed out]
pew has quit [Read error: Operation timed out]
kbtoo_ has quit [Read error: Operation timed out]
VoynichCr has joined #tumbledown
klg has quit [Read error: Operation timed out]
<einswenig> does the tumblr-grab client send heartbeats for active jobs to the tracker?
<einswenig> or rather, when are jobs rescheduled?
<trvz> no, and manually
qw3rty117 has quit [Read error: Operation timed out]
trc has quit [Read error: Operation timed out]
fred77 has joined #tumbledown
jbroome has quit [Read error: Connection reset by peer]
moufu has joined #tumbledown
sep332 has quit [Read error: Operation timed out]
<einswenig> should i report aborted jobs? I killed one after the change for (not) backing up notes
rpl has quit [Read error: Operation timed out]
upshift has quit [Ping timeout: 960 seconds]
bizzy__ has quit [Read error: Operation timed out]
moufu_ has quit [Read error: Connection reset by peer]
bmcmath has quit [Read error: Operation timed out]
qw3rty117 has joined #tumbledown
pew has joined #tumbledown
bmcmath has joined #tumbledown
kiska1 has quit [Read error: Connection reset by peer]
riley has quit [Ping timeout: 600 seconds]
paul2520 has joined #tumbledown
Stoner_Sl has joined #tumbledown
<scrottie> I've got a worker that's been doing "@ERROR: max connections (150) reached -- try again later" "rsync error: error starting client-server protocol (code 5) at main.c(1653) [sender=3.1.1]" etc for over two hours. advice?
<scrottie> in Upload status.
<urjaman> we know, wait...
<pnJay> Yeah, leave it running please. Our ingest box is on fire
<pnJay> and the fire is also on fire
<scrottie> whee. can do.
<kvikende> what does FOS stand for? all i can come up with is freedom of speech but i dont think thats right :P
<pnJay> fortress of solitude :D
<scrottie> free-open-source?
kbtoo_ has joined #tumbledown
klg has joined #tumbledown
sep332 has joined #tumbledown
<kvikende> ah. and everything is on fire
mhazinsk has joined #tumbledown
martinell has joined #tumbledown
kiska1 has joined #tumbledown
yano has quit [Remote host closed the connection]
riley has joined #tumbledown
yano has joined #tumbledown
jbroome has joined #tumbledown
bizzy__ has joined #tumbledown
VerifiedJ has joined #tumbledown
raingloom has joined #tumbledown
Plusi has joined #tumbledown
raingloom has quit [Client Quit]
mtntmnky has joined #tumbledown
<Plusi> Hello! One of my warriors is reporting a rsync problem, stating that it can't upload because it reached max connections.
<Plusi> rsync error: error starting client-server protocol (code 5) at main.c(1653) [sender=3.1.1]
<urjaman> just wait, it'll eventually work
<_niklas> the server is getting hammered, your upload will succeed eventually
<kiska> FoS is being overloaded currently just wait for it to work
<urjaman> maybe we should put that in the topic? ...
<kiska> I think we are at efnet character limit for topic
<JAA> No space in the topic. EFNet has stupid limits.
<Plusi> Oh, okay. I asked because one thread has been saying this for over two hours...
andx has quit [Ping timeout: 492 seconds]
<scrottie> Plusi, likewise.
<JAA> PurpleSym: I think you might've missed kiska's reply regarding which collection due to the network issues. It's the archiveteam_tumblr collection.
<Plusi> Alright then, I'll be patient. Thanks for the quick reply.
<kvikende> apparently FoS is on fire and the fire is on fire
<kiska> If I get access to the collection, I'll try and put this fire out
blueacid has joined #tumbledown
<urjaman> apparently that one i had waiting got a slot ... and it'll only take something between 15 and 45 minutes or so to upload :P (the speed is being well not my full uplink and all over the place)
<JAA> We could also upload to a different collection and move everything later since we're making so little progress at the moment. But not sure if Jason would like that.
<blueacid> Is the destination server currently dying / overloaded?
<blueacid> Because that's what's prompted me to join here!
<kiska> Yep we are slamming 1gbit through Jason's pipe
<JAA> Yeah, overloaded.
<_niklas> has anybody thought about adding one or more rsync targets on IaaS? packet probably has the perf on short notice, but their traffic is probably $$$
<kiska> Someone will have to have access to the collection, that is what I am trying to do
<blueacid> is it a sensible workaround, for those with enough disk space locally, to spin up more and more warriors?
<blueacid> Ie, yes, there'll be more of the warriors sat trying for a connection / uploading slowly
<JAA> That'll slow it down even more for everyone else though. :-/
<blueacid> but if the tumblr data is "safely" on my local drive, waiting for a slot, that's better than if it were still on tumblr and in line for deletion
Frinkel has quit [Read error: Operation timed out]
<blueacid> Ah good point there
<JAA> We should have another target up in a few hours, so it's not like it'll be like this until Tumblr goes down.
<Plusi> I've got four warriors at home (in four different machines), should I shut down some?
<JAA> Yay, I'm in the top 100 for the first time. :-P
<kiska> xD
<urjaman> if they're only waiting on upload it might not hurt to suspend to ram them, but ... i'd say dont shutdown them because that would uselessly lose data
<scrottie> JAA: that means everyone is gunning for you ;) but congrats!
<blueacid> ... yeah, surely if they're just sat in the 60-second retry loop then that's almost not a bad place to be?
<blueacid> I just noticed that a load of mine are sat trying to upload & those which ARE uploading are going very slowly, which prompted me to come find out what's going on
<trvz> people with less space will run out of it faster though
<Plusi> Okay then. AFAIK only one is sitting with now two threads waiting on uploads.
<blueacid> Yeah.. I've got loads of space, so don't mind holding a load of them in that state for longer
<JAA> Plusi: Just let them sit, I'd say.
<blueacid> I don't think there's any way to express that to the warriors though
<foureyes> something's wrong here... my clients have trouble reaching FOS but delete the warcs like they had been uploaded. nevertheless they keep on retrying every 60s, only to print "rsync: change_dir ... failed: No such file or directory" when they manage to get one of teh 150 slots
<JAA> Uhh, wat?
<JAA> Deletion should only happen on successful upload.
<JAA> And that's working fine on my end at least.
<blueacid> Yeah that's what I'm finding too; deletion doesn't seem to happen early
<blueacid> foureyes - are you using the virtual machine image, or the docker image?
<JAA> Or the scripts.
<trvz> what would the instructions be to set up an rsync target?
<kiska> JAA: This should give me the desired result of embedding the cookies into the job
<kiska> '--header', '"Cookie:"',
<foureyes> blueacid: neither. seesaw from pip and tumblr-grab from github.
<_niklas> why the extra quotes?
<JAA> foureyes: Hmm, I'm using the same setup, and it works fine here. That's very strange and concerning.
<kiska> I'll try it without the double quotes
<horkermon> foureyes: --keep-data
<JAA> kiska: Yeah, without quotes.
<JAA> But the way we usually did it was a cookie.txt file, I think.
<kiska> Right I don't remember if wget requires the --header field to be ""
<JAA> With --load-cookie-file or whatever it's called.
<_niklas> I ran my tests with '--header', 'Cookie: pfx=...; pfg=...'
<kiska> Oooh? Let me find a sample project with that I guess
<psi> Oh, I managed to snag an upload spot ^^
<_niklas> that option would be --load-cookies, and it takes a netscape cookie jar as parameter
<JAA> wget simply wants the plain value. That's why you need to wrap it in quotes *in a shell*. But Python simply executes that command directly.
<JAA> I.e. passes each string directly as an argument.
<kiska> Right....
<kiska> Let me find a reference implementation, and go off that
<JAA> So regarding --load-cookies, I know we did that on SPUF, for example.
<JAA> That cookie file was generated dynamically by logging in, which may also be of interest here.
ultramage has joined #tumbledown
<foureyes> JAA: the setup has worked before and has uploaded more than 120gb data. the problem appeared somewhen within the last 12 hours.
<ultramage> "error: max connections (150) reached" - can you please
<_niklas> if there's exactly one hardcoded set of cookies we're gonna use, I don't see a benefit in --load-cookies really
<JAA> ultramage: The machine's overloaded. Just let it retry until it succeeds.
<Plusi> ultramage: Known issue. Server is overloaded.
<ultramage> it's been trying for 30 minutes D:
<urjaman> others have been trying for hours :P
<scrottie> ultramage: welcome to the club =)
<ultramage> ^^
<JAA> _niklas: Yeah, the only benefit I can think of is that it would be bound to a particular domain, whereas --header would cause it to be sent everywhere. Not sure if that matters though.
<_niklas> oh
<_niklas> right
<_niklas> arguably that does matter
<_niklas> since we do access non-tumblr domains
<scrottie> ultramage: there are reports that "the fires are on fire". people are working to add capacity.
<Nemo_bis> Over 400 GiB of my downloads have just been deleted before they could be downloaded, for one reason or another. Now scaling up again.
<scrottie> eeeeeek
<PurpleSym> JAA: No, I got disconnected afterwards. What kind of access are we talking about? I don’t have admin superpowers on archive.org
<kiska> I require upload perms on archiveteam_tumblr
<ultramage> sounds like people are having bigger trouble than my tiny default 400MB vm under 1GB of memory pressure (luckily suspending some of the jobs is possible)
<JAA> Nemo_bis: Oh shit. :-|
<JAA> So if anyone has any details about those cases where data gets deleted (storage setup, special error messages, whatever), please do share.
<PurpleSym> kiska: I don’t have access to the ACL management, sorry.
<ultramage> if the machine is overloaded then that means the updated scripts are doing their job? :)
<scrottie> ultramage: that we're collectively burning this %&@%#$ down is good though!
<hook54321> I'm assuming it's not grabbing videos, correct?
<foureyes> JAA: i've just seen the data loss happen live and i managed to grab a python traceback. i'll report details as soon as possible (currently @work and with the hubbub going here it's hard to concentrate)
<kiska> PurpleSym: Drat! Oh well
<JAA> foureyes: Excellent, thanks!
ultramag1 has joined #tumbledown
Atom has joined #tumbledown
ultramage has quit [Read error: Connection reset by peer]
<kiska> I was about to spin up scaleway, then I saw this... https://cdn.discordapp.com/attachments/217293255237173248/523145149006479372/unknown.png
<kvikende> uploading with sub 100 kb/s, this is gonna take a while...
<trvz> kiska: you'd need to do https://romanrm.net/mhddfs
<kiska> Uploading at 1MB/s
<JAA> kiska: Yes, ScaleWay is weird. We had an ArchiveBot pipeline which constantly ran out of disk space because it only used one of the volumes.
<kvikende> it jumps to 1 mb/s occasionally. but i have to go on a date :D
<fireglow> Is there a tool to just scrape images from a tumblr blog? A few months ago I found some blog import tool that could do it, but I forgot the name now
<ultramag1> aww yea, going at 330kB/s now... which is the usual non-pipelined latency to east coast US
<blueacid> Enjoy your date, leave the computer to work on tumblr while you're out!
<blueacid> hook54321: Not sure about videos, but I know that one of the blogs that my warrior is grabbing is filled with GIF images, so they're pretty large
<_niklas> just call it off
<scrottie> fireglow, I have something that just scrapes images and videos but has --no-image --no-video options.
<scrottie> fireglow, requiers a login.
<urjaman> hook54321, blueacid: yeah afaik no videos but yes pictures (and that's gifs too ...)
<hook54321> k
<_niklas> if you tell them "sorry I need to download as much pornography as I possibly can within the next 3 days" I'm sure they'll understand kvikende
<urjaman> :D
<scrottie> there are a lot of 403 videos. I wonder what in general gets the ugo-r treatment.
<_niklas> they started killing off videos a few months ago already
<_niklas> that's why
<scrottie> argh. figuers. thanks.
<_niklas> nsfw videos that is
<blueacid> Oh wait, so are the archive efforts trying to get the videos but getting 403/404? Or are these scripts making no efforts to obtain videos?
<_niklas> they used to try but not anymore, I forget the exact reason why but it was deemed impractical
<scrottie> I'm doing some archiving of my own for what it's worth.
<urjaman> blueacid: i think scrottie was talking about some other scraping script (not the warriors or whatever)
<scrottie> right.
<blueacid> I wonder if there's time to make scrottie's script for the videos into a warrior project as well? I've not much time on my hands but my gigabit connection is "only" sat at around 200mbit down
<_niklas> there's bigger issues right now
<scrottie> blueacid, IMO, if you want to, grab videos for things. discussion can happen later about merging those back in or having them seperate on the archive or hosted elsewhere or whatever.
<fireglow> scrottie: nice, is the tool open source?
<adinbied> Any ETA on adding a new pipeline?
* scrottie mumbles to itself for a minute
<scrottie> okay let me create a gist or something.
<scrottie> not sure i want to support my crappy perl and I swore to myself that I'd start rewriting/writings I show other people in SNOBOL.
<trvz> kiska: you can add additional rsync targets without having write access to the IA collection, right?
<scrottie> and also documenting one-offs.
<hook54321> _niklas: I'm guessing it either couldn't successfully grab them, or they were simply too big.
<kiska> I can it'll just fill up the drives very quickly
<hook54321> Since we have limited time we have to cut some corners
<kiska> The thing is, for every reblog the video will be grabbed again, so we need to dedupe
<kiska> That issue will come up soon™
<kiska> I'll be back in 1-2 hrs
<hook54321> ah, i see
<psi> Oh, Scaleway has cheap storage :o
<trvz> it's cheap for a reason
<_niklas> I'm not willing to spend more on this than I already am tbh but if someone else is... how quickly can you fill up https://www.packet.com/cloud/servers/s1-large/ :^)
<psi> what's the reason
<_niklas> scaleway's storage is weird-ass network block storage
<psi> ah
<_niklas> and comes in volumes of at most 150 gb
<kiska> _niklas: Take a look at: http://fos.textfiles.com/pipeline.html and special look at /2
<_niklas> they do have boxes with a directly attached scratch ssd tho
<kiska> That is where tumblr data lives
<psi> but doesn't mhddfs solve that problem like trvz just said?
<blueacid> kiska: Is that box the one that's got the saturated 1gbit link?
<kiska> Yes pipeline.html shows stats on fos
<kiska> Updated per hour
<blueacid> Guessing that this server is desperately sending data TO the internet archive as well, or are we going to run out of space in 5.6TB's time?
<blueacid> (which, at 1gbit, isn't all that long from now)
<blueacid> For those looking for cheap storage, take a look at seedboxes as well - feralhosting.com's largest box is £60 for a month but has 8TB of storage, and a 20Gbit pipe
<blueacid> (the disks are capable of 40Gbit/s)
<kiska> Lets not talk about this here thanks
<kiska> Yes desperately is one word
<blueacid> Gotya, apologies
<kiska> I might use one of those as a rsync target if it supports me using terminal
<_niklas> they do, but at those prices I can't imagine they like the idea of someone actually exhausting their capabilities
<kiska> Well xD
<_niklas> can always drop them an email I guess
<ultramag1> any idea if / how much it's possible to burst across the atlantic? (thinking point to point udp without any flow control)
<blueacid> _niklas: Absolutely, worth dropping them a line, they might be able to grant more, slower storage (presumably that's the order of the day)
<blueacid> ultramag1: Totally and utterly depends on your ISP and what transit network they're connected to / what their peering is like / etc
ultramag1 is now known as ultramage
<JAA> Actually, Feral might be fine with it. I know we were in contact with them last year regarding another project, and they were even willing to donate some boxes at the time.
<ultramage> I see... I'm familiar with tcp transfers, with sliding window in-flight packets enabled, it's usually 330kB/s max for me... always wondered if that's the cap
<_niklas> you've never seen more than 330kbyte/s us<>eu?
<kiska> So I'll sign up for their mid tier box?
<foureyes> JAA: problem description and traceback... https://paste.geekosphere.org/paste/RCgFNbBaRK4hKwdqC5
<trvz> kiska: I'll have 8TB ready shortly
<ultramage> _niklas: over tcp no... over udp dunno, I was never in a situation where a transfer like that would happen
<urjaman> ultramage: i get more than that uploading to fos ...
<kiska> -_- I was going to sign up...
<urjaman> which afaik is in the US (and i'm in eu)
<ultramage> ah, you can, it'll just take 10 hours to submit a 10GB blog
<_niklas> if they're friendly, might as well get back in touch?
<kiska> I got 2 mbit from the EU
<kiska> But that was to Aus
<ultramage> I was prepared to use my full 100/50 on this, but the webserver is so slow that 6 jobs average at 200kB/s... so this becomes a slow game of patience
<erine> about scaleway: it's actually usable in raid0
<erine> that's how I've been able to glue together 1 TB for a brief moment
<kiska> ultramage: Be prepared for the UA update, it'll speed things up
<ultramage> not really possible, I expect these jobs to take several more days to finish
<_niklas> probably worth restarting for that
<ultramage> not really unless you reprogrammed wget to fork-bomb the tumblr servers
<urjaman> and you can run more warriors
<ultramage> if it can't do at least 100 items per second then it's not worth it
<kiska> GoogleBot UA is rate limited by tumblr's servers, if i use any other UA, I get much better throughput
<ultramage> atm it's doing 1 item per second
<erine> Could I get some warning about the UA update before it drops? :P
<_niklas> not rate limited, exactly, but getting lowest priority
<blueacid> THAT's what's making it slow!
<kiska> It'll drop when it drops?
<erine> I'm gonna spin down 15 of my warriors before I get rate limited :P
<blueacid> I wondered about tumblr themselves being a bit sluggish, but I ignored it until I couldn't also upload
<erine> I'm assuming that 150C per server may be too much with the updated agent
<_niklas> yeah
<hook54321> upload issues are related to FOS
<blueacid> I guess though that this rate limiting at tumblr's end is actually a good thing currently - since the FOS server is currently the bottleneck
<erine> I'll pray for the other AFK megawarriors that haven't been warned.
<blueacid> (and its free disk space dropped by 400GB in the past hour)
<kiska> I'll probably push the update once I get another rsync target up
<klg> I don't think any other ua can bypass GDPR consent so effortlessly; though curl and libwww are enough to bypass safe-mode
<_niklas> erine: what I've measured is ~3.5mbit/s per agent with browser UA
<kiska> There is a thing called Cookies
<ultramage> last evening it was 90000 blogs left, now it's 45000... I think you're good on the number of downloaders... the question is if these 100-gig blogs finish downloading by the deadline
<_niklas> klg: curl and libwww user agents bypass safe mode?
<_niklas> that's new
<klg> btw I think the figure of 450M on /about page is the highest blogid, there are fewer blogs that are live
<klg> yes, they can
<hook54321> More blogs exist than just what's left in the queue
<kiska> Yep and redis can't do 450m items, otherwise the tracker will drop to a crawl
<ultramage> where did the current list come from? can the server be probed to discover other blogs?
<kiska> List came from user submissions
<hook54321> and other sources
<blueacid> is the goal to try and get all of tumblr? Or only those blogs which are likely to suffer from the deletion?
<urjaman> NSFW first
<ultramage> ah alright... too bad there's no numeric id thing to resolve the blog subdomains
<pnJay> blueacid: for right now, we're just grabbing stuff we know is gone by monday. This is a clear sign that the rest of tumblr will go though, we're gonna learn from here how to do it better at scale
BrickGras has joined #tumbledown
<_niklas> went ahead and verified that
<_niklas> unfortunately curl UA is also slow
<kiska> How slow?
<_niklas> same as googlebot
<kiska> Urgh
<hook54321> I thought the UA patch is gonna change it to Chrome
<kiska> No I am changing it to "ArchiveTeam"
<hook54321> ah
<kiska> I just restarted my PC so it'll take me 5-10 mins to get everything back in order
<hook54321> If they don't know what we're doing already, I'm sure they'll find out after we switch to that UA
<psi> erine: I tried your Prometheus + python watcher combo - looks good! Are you running all of your jobs on different ports?
<kiska> If they ban that UA, then I am using Chrome
<_niklas> btw, I also tried mobile UAs earlier to see if they get shorter responses for posts - nope
<erine> C
<kiska> :(
<psi> k8s or something else?
<hook54321> What if they ban all ips that were using that UA?
<erine> just plain docker
<psi> I see
<mbp> Starting RsyncUpload for Item tumblr-blog:unpeacekeepers
<kiska> Then that is a problem
<mbp> @ERROR: max connections (150) reached -- try again later
<mbp> oops
<horkermon> i'm down to help with any massive scaling efforts. makes no difference to me if AT's infra can't be adapted in time
<mbp> whats going on
<psi> Just all dockers with 1 conc warriors then
<pnJay> we're having ingest problems
<scrottie> hi mbp, it's a dogpile!
<ultramage> I'll be hogging one for the next 12 hours at 300kB/s... better adjust that number
<JAA> foureyes: Thanks. Interesting, I've never seen that error before.
<kiska> I have
<kiska> 40 mins ago
<horkermon> tracker def won't be able to keep up with a much higher throughput
<mbp> too many archivers :p
<erine> nah, 6
<horkermon> can it be sidestepped?
<psi> ...how much hardware do you have lmao
<erine> just two scaleways
<erine> C2Ls
<_niklas> btw ultramage to answer your earlier question: I can do a couple dozen mbit/s eu<>us over http on a meh cable network
blahblah has joined #tumbledown
<scrottie> mbp, word on the street is that fortress-of-solitutde is maxed out. as far as I've gathered, it's a buffer that then pushes to the archive, and it's full and overloaded. I'm probably wrong on that. but you aren't the only one. suggestion is to leave it trying and people are adding capacity or trying to.
<erine> 32GB, 250 GB actual SSDs (Samsung), 8 atom cores
<psi> That's not super pricy actually
<erine> only regret is that I'm about to run out of space because stuff isn't uploading fast enough
<JAA> ultramage, _niklas: I managed to push close to 1 Gb/s to FOS the other day from one of my ArchiveBot pipelines in Germany. Although that's rare, it's usually less than 100 Mb/s. But yeah, that's definitely not the issue. The machine's just getting hammered right now.
<psi> Is that 32 warriors per scaleway actually or 32 overall
<psi> I imagine per but
<erine> 32 / scaleway
<marked> _niklas: might not be relevant to us in any way but there's a mobile endpoint at /mobile
<psi> yeah that's insane
<pnJay> Didn't we get a bunch of archivebot spam? or did i read that wrong
<erine> brings me to 192 concurrency
<foureyes> JAA: one by one all my clients started running into this... I've now killed everything. can you please release all claims from "Archibald"?
blahblah is now known as super2K
<psi> I'm on 18 rn so it seems very big ^^
<erine> I blame @erin for giving me the idea of overstuffing being a semi feasible idea
<kiska> foureyes: All claims released
<psi> I have 18 conc on what I wouldn't exactly call a high end box
<erine> cheap digitalocean?
<psi> cheap OVH
<kiska> OVH network is ok
<mbp> 16 on an atom @ ovh too
<erine> ah, they're unlimited right
<erine> for bandwidth
<mbp> 0fucks@ovh
<psi> I like my unlimited bandwidth yeah
<kiska> xD
<super2K> We envy your bandwidth limitation
<JAA> Only issue with OVH is that the IPs are sometimes blacklisted.
<scrottie> heh
<JAA> Because, well, some shady people like OVH as well.
<mbp> 5€ a month for a dedicated is too good
<JAA> Oh, that's Kimsufi then though, right?
<psi> Does Scaleway also have unlimited transfer?
<_niklas> online.net (ovh's biggest competitor, who also run scaleway) is also surprisingly good given the bottom-of-the-barrel prices
<mbp> yes the smallest atom they have
<JAA> I know we got kicked off Online for high traffic usage before.
<mbp> isgenug/kimsufi/soyoustart or whatever it is
<kiska> RIP ST perf
<psi> Yeah, I just saw the price of the C2L, and whoof. That's sharp
<scrottie> furfur.tumblr.com is far more surreal than I'd have predicted from its name.
<psi> Plus the bonus of extra space at €1/50G is not bad at all
<JAA> mbp: Isgenug == Kimsufi are the ultra-cheap servers. So you Start is the middle-class offer. OVH is the professional one.
<_niklas> online do tell you how much you can use though
<JAA> Oh, do they?
<psi> scaleway advertises unmetered bandwidth :thinking:
<psi> I might just get me one of those
<kiska> Out of stock...
<JAA> TIL. Well, they like to change things more often than most people change their clothes.
<JAA> It used to be separated like that.
edgaveup has quit [Ping timeout: 600 seconds]
<kiska> FeralHosting no good, 150 mbit test out to SF
<erine> oh man the ram
<mbp> there are a bunch of scripts that track the availability and alert you
<erine> scrottie: that is an amazing blog
<_niklas> dunno about scaleway's usage policies, but online.net has (or at least had) very cheap servers with proper unlimited bandwidth, which I take to mean they don't care if the network goes to shit for the really cheap customers
<trvz> hetzner isn't a long term solution, their upload to San Francisco seems to be 150Mbps at ebst
<endrift> has anyone else seen @ERROR: max connections (150) reached -- try again later
<kiska> Yes we know
<JAA> _niklas: Yeah, they advertise that, but they do kick when you use too much.
<endrift> oh ok
<_niklas> I meant the ones that are <20€/mo. the more reasonably priced ones have "premium bandwidth x00 mbit/s"
<JAA> I don't know the details, but I believe we had one of their servers as an upload or deduplication target for Newsgrabber for a while. And it wasn't one of the extremely cheap ones either.
<Nemo_bis> kiska: you may need someone who peers with spectrumnet.us to get decent uplink to FOS
<_niklas> i.e. you get a gigabit port but if you're consistently using over the few hundred mbit/s listed there they tell you to pay up
<_niklas> (they have a Really Unmetered upgrade nowadays but at that point the price isn't that good anymore)
<hook54321> Nemo_bis: Is there a way to know who does and doesn't peer with them?
<_niklas> I should drop it, I suppose it's offtopic at this point
<scrottie> endrift: hopefully more capacity soon. so, clap yourself on the back for helping burn this shizz down.
<erine> Not really, when the topic is about scaling the warrior infrastructure to scrape faster.
<erine> Something we need for that 17th deadline!
<_niklas> for warriors they're as good as anybody, but they're not the place for a (lasting) rsync target
<erine> crap, this was not about the warriors?
<kiska> Nemo_bis: Did you say spectrum?
<JAA> horkermon: So regarding the tracker, we can't get rid of it, no. There needs to be coordination between the workers so we don't retrieve things multiple times etc. But what exactly do you mean that it doesn't scale well? It's known that we can't have more than 400k or so items in the queue, but otherwise, it should probably hold up pretty well, especially considering how long scraping an individual blog
<JAA> takes. We'd just need to reload items quickly enough.
<kiska> Thanks, I'll take it look
<Nemo_bis> kiska: or Wave, same thing; https://monitor.archive.org/weathermap/weathermap.html
<JAA> (kiska: FYI, guest4792 in #warrior is human4565 from a few days ago.)
<marked> can the tracker be reloaded from the command line?
<JAA> In theory it's probably possible. But we can also ^C ^V *click* once every few hours or whatever.
<horkermon> JAA: either of the rate per item being cut substantially or an order of magnitude increase in the number of items considered to be the target worth hitting asap will cause the tracker's redis etc to be another bottleneck, right?
<kiska> Right now the bottleneck is SketchCow rsync target
<kiska> Because we are doing maintenance on HCross target
<JAA> horkermon: I think it should be able to handle a higher rate of item claims and completions fine. An increase in the number of items would mean we'd need to add more items frequently, but it shouldn't be an issue unless we get to extremely high rates.
<JAA> As far as I see it on my machine, each blog still takes a couple hours to grab. So even if you throw enough hardware at it to process 100k items at once, the claim/completion rate would still be only ~14 per second assuming 2 hour average job time.
<JAA> And that would mean we'd have to add new items every 6-8 hours.
<JAA> But yeah, we need more rsync target capacity primarily. That's going to remain the primary bottleneck for now.
<foureyes> JAA: tried again with python 3.5 rather than 2.7 and now the pipeline keeps on trying without failing and deleting the files. I'll keep an eye on it.
<marked> it'd be nice if the pipeline script understood it could rsync later
<JAA> foureyes: Huh. Well, I'll take that as another reason that we should fully migrate to Python 3 already. 3.0 was released over 10 years ago, dammit.
<horkermon> do we know whether these large UL values that've been coming through for a while now are indicative of anything
<_niklas> marked: disagree, the data would *really* pile up then
<horkermon> cascade of heavy items piling on would be expected whenever there's rsync target slowdown
<_niklas> well that's already happening
<_niklas> it would be worse if the script went as far as to defer uploading
<horkermon> better if prioritizing by size though
<urjaman> _niklas: as i see it that would be okay tho, if the goal is to get as much out of tumblr as possible before the deadline ...
<_niklas> no good if I have to drop jobs that had been running for a day because my disks are filling up
<urjaman> oh yeah the downloaders would obviously need to understand to pause when disk full
<yano> heh, so it seems that a 1 GB of RAM droplet from DO wasn't enough memory
<yano> kept getting "Cannot fork"
<yano> upgraded to 3gb
<erine> go for scaleway! :P
<marked> how many processes?
<yano> marked: i couldn't check
<yano> i couldn't even run `uptime` or `top`
<marked> do you know how many you asked for?
<yano> how many what?
<yano> rsync threads or concurrent items?
<erine> concurrent items
<yano> 6
<yano> i did max that out
<yano> that was my fault
<JAA> Yeah, don't go above 2 jobs per GB of RAM.
<yano> ah
<yano> gotcha
<yano> well, i'm up to 3gb now so that should work for 6
<yano> and the spare desktop i have it running on has 12gb
<marked> if someone spare mental cycles we have a 1.20 build that might fix a memory leak
<marked> we won't know until someone runs it
<erine> on a scale of "don't do this" to "it can scale", would you recommend stuffing that in a test docker container
<marked> someone probably knows a way to watch it for real, but since they're not here. one way to find out is running the same job set while reducing memory allowed in a VM
<marked> and seeing if it'll complete when the old version would not
<marked> maybe turn of virtualmemory
<marked> off
<_niklas> this is starting to get tense https://i.imgur.com/koBQ2p8.png
<marked> if moved concurrency from into pipeline instead of the OS they could start preempting each other, very hypothetically
<marked> from OS into pipeline
<ultramage> you can kill -TSTP (and later -CONT) individual PIDs
<blueacid> Regarding the bandwidth and peering conversations, especially from a european drop-box (like feral etc), could we not open multiple connections to the destination?
<blueacid> EG if each end was on, say, 1gbit, and any one connection would get ~130mbit, surely you could open 10 connections and make sure to max things out?
<marked> to that would be TSTP those that are starting fresh, and let the ones about to finish to finish
<ultramage> the bandwidth / rate you get depends on what their rate limiting policy is... if it's by IP, more instances won't help much
<mbp> paging the man with the big pipes, big pipes please, to the front desk, please
<erine> agreed about the tenseness https://i.imgur.com/lx3SGh8.png
<pnJay> we usually do have a second rsync target if that's what youre asking about blueacid
<blueacid> I mean i'm in the UK and have 1gbit internet... just, I don't have more than around 2TB free disk space, otherwise I'd offer!
<_niklas> it's probably not intentional throttling
<_niklas> just bad network
<JAA> blueacid: The pipeline already uploads in parallel if possible. The problem is simply that the receiving machine can't handle more.
<_niklas> that's a pretty tmux erine
<erine> I'm spinning down a lot of my warriors but boi, 75% is starting to get too close to wanting to shut it all down
<ultramage> I have 100mbit and getting 200kB/s on 6 parallel tasks... either they're rate limiting queries to 1/second, or the webserver page load delay dominates the time (individual items tend to be small and download instantly)
<_niklas> I thought you were talking about rsync there
<ultramage> I'm used to shitty new websites taking several seconds to even begin loading, so I wouldn't be surprised
<_niklas> the page load delay absolutely does dominate
<marked> macOS is prettier even if it's the same data
<mbp> google bot is throttled yes
<JAA> Yes, our user agent (Googlebot) is getting rate-limited.
<ultramage> individual wget processes are set to only download 1 item at a time? no parallelism?
<_niklas> yup
<ultramage> because that would be my strat... instead of doing 6 blogs at once at 1 item each, I'd just do 1 blog with 20 threads
<blueacid> So to try and link together the thoughts, it seems there are a few bottlenecks. If i'm not wrong we've got 3 places where stuff is slowing down
<JAA> 3?
<blueacid> 1) For each tumblr blog, one warrior job is wget'ing in a single thread all the posts, images and gifs. Tumblr rate limits that one process to 1/sec
<blueacid> 2) Once a warrior job has got everything, it's trying to upload everything to the staging server, which is getting the shit kicked out of it, and has only 150 upload slots.
<blueacid> 3) Once it's on the staging server, that then needs to pass the data up to the internet archive, but given the staging server's free disk space is dropping, that's also a bottleneck
<blueacid> Is.. is that right?
<JAA> 2 and 3 can both be solved by the same thing: more staging servers.
<horkermon> staging also is megawarc step
<JAA> I think the plan is to get HCross's machine back online once its disk usage is down again.
<marked> does the IA ever run out of space ?
<super2K> blueacid: what are requirements for staging?
<super2K> Err JAA, that shoudl go to you
<blueacid> super2K: I don't know, I'm just trying to make sense of where the limits are - my list of 3 wasn't so much me asking as me going "is.. is this right?" and hoping that someone more knowledgeable than me chipped in
<marked> can we move a staging server into IA HQ ?
<horkermon> jason said earlier that they have ~3PB unallocated
<JAA> That's not the problem. The problem is that we don't have access to the IA collection.
<super2K> JAA, Im superkb - Just I left chat on in my basement
<JAA> marked: FOS is at IA HQ.
<super2K> I didn't register it, so I cannot ghost it
<JAA> lol look at this cutie thinking you can register anything on EFNet. ;-)
<JAA> There is no nickserv, chanserv, or anything like that here. Only anarchy.
<super2K> I haven't irc'd much since the Dal's twisted server got taken offline by the massive bots of early 00
<hook54321> I'm assuming we could theoretically upload stuff without it going into the collection and then put it in there later, but that isn't ideal.
<erine> i just sent nickserv "help" and i have no clue what i even expected
<hook54321> EFNet doesn't have nickserv
<erine> pretty sure that user is getting boatloads of people's passwords from people that don't know better
<hook54321> lol
<blueacid> Hah
<JAA> hook54321: Yeah, I mentioned that possibility earlier. On the other hand, Jason will be back in a few hours, so then we should definitely be able to add more targets.
<marked> the doors that those passwords would open would be startling
<blueacid> something something, hunter2
<mbp> Fusl is hogging all the upload slots
<mbp> :p
<blueacid> I've noticed a bug in the warrior as well, but it's minor
<blueacid> If a job takes longer than 24 hours, the time elapsed wraps around
<blueacid> I've a job on the go which has been going for around 30 hours but it's showing 6hrs
<urjaman> yeah i noticed that also (i suppose it's splitting it to days, hours, mins and secs but showing only the hours and so on part)
<marked> lots of us has been there. that reminds me a fix to the log rotation problem could be saveing last log in javascript before a restart
<horkermon> dumb storage could also work for workaround targets. i'm going afk but could offer some of that if the bottleneck's still a problem later
<_niklas> marked: I sent you the chrome crawl warc of 9volt-art earlier, so you have test data for https://github.com/ArchiveTeam/tumblr-grab/issues/42#issuecomment-447348215
<marked> thanks for reminding me.
<_niklas> a thing that's likely to come up is should we blacklist /post/<id>/<slug>?is_related_post=1
<_niklas> pretty sure those are always gonna be duplicated content
<_niklas> (and that's probably most if not all URLs those extra 100kb add)
<erine> related PR for that wraparound issue https://github.com/ArchiveTeam/seesaw-kit/pull/107
<marked> how do the base and extra post relate, is it an iframe to each other?
<_niklas> the extra related post content is baked into the primary post html
<marked> from what yo usaw is it always a different account or the same account?
<_niklas> always the same, someone else confirmed this from experience using tumblr
wp494 has quit [Read error: Operation timed out]
wp494 has joined #tumbledown
boutique has joined #tumbledown
upshift has joined #tumbledown
boutique has quit [Remote host closed the connection]
<marked> knowing what we know now about response times redundant requests to the endpoint should be minimized
<marked> maximize coverage with minimal redudancy
<odemg> HCross, did I know what remote hands are?
<odemg> kiska, we doing another code update today?
<kiska> odemg: Once HCross has his machine back up
<odemg> okie dokie
argus has joined #tumbledown
<marked> kiska : do you have a feeling on whether lua's memory usage is from something in the script vs lua being broken?
<kiska> I have no clue, it might be from the downloaded table being so full
scrottie has quit [Ping timeout: 600 seconds]
<jbroome> "@ERROR: max connections (150) reached -- try again later" known issue when uploading?
<psi> yes
<psi> server is overloaded, just wait for it to complete
<marked> kiska: i'm disappointed I can remember how to spell fluffy wolf strife now
<kiska> xD
<jbroome> no worries. i'll let it sort itself out. thanks!
<JAA> Yaay, it took over 4 days, but tumblr-blog:asianmansex is finally done! \o/ And now I get to wait for the upload.
<mbp> archiving tumblr has a radicalising effect on the warriors
<pnJay> Ive never been so happy for a grab JAA xD
Ninjoon has left #tumbledown [#tumbledown]
boutique has joined #tumbledown
<boutique> anyone getting rsync errors when uploading?
<boutique> apologies if this has been discussed already
<mbp> yes, patience
<boutique> fair enough :)
<mbp> not your fault, someone should put it in the topic
<mbp> :>
<urjaman> topic is full
<marked> we need a recently frequently asked questions
<boutique> nobody would read it :p
<diggan> woho, woke up to over 1TB uploaded!
<diggan> 1.5TB actually
<jbroome> dang
<jbroome> sorry, i should have done a /lastlog rsync before asking.
<_niklas> >redundant requests to the endpoint should be minimized
<_niklas> with browser UA, that's less of a big deal
<_niklas> though time to first byte remains the majority of the request time
<_niklas> even for dumb canonical url redirects
<sep332> we can remove the tracker link from the topic, at least for now
<pnJay> one of my warriors just archived "seinfeldtv" wtf lol.
<kiska> But not pushed to master
<kiska> Neither is it tested
JAA changed the topic of #tumbledown to: https://archiveteam.org/?title=Tumblr | Need saving? https://goo.gl/RtXZEq | Scripts: https://git.io/tumblr | UPLOAD TARGET OVERLOADED, PLEASE BE PATIENT
<kiska> I am surprised you were able to put that in
<JAA> I removed the tracker link.
<kiska> Ah xD
<JAA> :-)
<diggan> JAA: fine to just leave the warriors running or we need to do something to pause?
<Jens> ArchiveTeam using URL shorteners...
<JAA> diggan: Yeah, just leave them running.
<diggan> 👍
boutique_ has joined #tumbledown
<JAA> Jens: At least they're service-internal shorteners, not bit.ly or some shit like that.
<_niklas> the only cookies we _need_ are pfg and pfx. dunno if the others hurt anything?
tungol has joined #tumbledown
<diggan> anything we can do to make the upload target less overloaded?
boutique has quit [Read error: Connection reset by peer]
<Nemo_bis> probably, avoiding slow uploaders
<marked> I'd feel more comfortable with the minimum cookie needed, but I'll defer to those that had to do cookie crawls before
<blueacid> Nemo_bis: I thought it was overloaded in terms of the 1Gbit connection the target has that's maxed out as well
<urjaman> i didnt get to use all of my (lol) 10 mbit/s upload with the two last things i uploaded, so...
<blueacid> sure, it has 150 slots, but between all 150 concurrent uploads, its pipe is full up as well... so those with slow upload speeds aren't really the problem as such
Lady has joined #tumbledown
<hook54321> marked: Wouldn't we want a cookie as close to what users would normally get as possible?
<urjaman> blueacid: yeah that's what i mean ... having the connections limited keeps the thing going (instead of being overwhelmed by more connections) but the bandwidth is maxed out already so...
<Lady> hey all. I wanted to get a ballpark estimate on what the upper range of # of files to download for a tumblr account might be. Got a couple that are like 30, 40k files and just want to make sure that's not unheard of
<marked> my thought was if we don't know what the cookie means we could have an undesirable setting vs if its blank it'll choose a sane default
<pnJay> we have a second rsync target coming back up, unfortunate timing and adisk problem I think
<pnJay> Lady: totally reasonable sizes
<Lady> too much shitposting :] thanks
<pnJay> :D
<boutique_> shitposting broke the upload server :<
mib_qhxj5 has joined #tumbledown
GDorn has joined #tumbledown
<JAA> diggan: We'll get another upload target online soon™.
<diggan> great!
<Blokatt> @ERROR: max connections (150) reached -- try again later
<Blokatt> can't upload
<GDorn> how long does an 'out' item stay reserved? if it takes three weeks to finally upload a really big warc.gz?
<ultramage> pro strat would be to grab some sort of size estimate from each blog, maybe from the number of pages or posts or whatever metric is available, and have the heaviest ones be done on dedicated hosts, maybe with tweaked parameters
<GDorn> Blokatt: read room info
<jbroome> Blokatt: known issue listed in /topic
<Blokatt> ah, I see
<Blokatt> thanks!
<jbroome> i ran into the same thing. :)
<diggan> I hope that extra upload target will be able to handle the floodgates once they open again, I think ~20% of my currently worked on items are waiting for uploading right now
<_niklas> hmm
<blueacid> diggan: Presumably the second target will also have a slot limit as well
<boutique_> same, i've got 200GB+ waiting :p
<_niklas> I'm getting 500s on an image from the cdn
<blueacid> _niklas: uhoh
<_niklas> it *is* just one image for now
<marked> if it's jut one image that's comon
<marked> link?
<Blokatt> Should I suspend the warrior VM until the issue is resolved or should I just let it continue trying to upload?
drcd has joined #tumbledown
<blueacid> 500 internal server error here, too
<urjaman> Blokatt: just let it continue
<Blokatt> alright
<jbroome> Blokatt: conventional wisdom in here is let it run and stack up stuff
<GDorn> I set up my VPS to run multiple copies of the script so if I end up with a huge number of uploads waiting, I can still download more. after all, we can keep uploading things after the 17th, we can't download more.
<JAA> GDorn: Claims are not released automatically, i.e. they stay in "out" until someone does that manually.
<marked> they have a bug in assuming every image comes in 1280
<blueacid> JAA: Uhoh, I think I shut down my warrior a couple of times, figuring it 'checked in' regularly... there might be some claims against my username which aren't being worked on
<JAA> diggan: It was fine when we had the other machine previously; it had to be taken down because it was running out of disk space due to too slow processing/uploading to IA.
<JAA> So it should be fine when that machine comes back up.
<ultramage> my 66. cdn downloads are getting 200s, hopefully it's not a widespread problem
<blueacid> But a load of the ones I'm working on are long-running (i've got some which are up to 300K items)
<JAA> ultramage: 200 means success...
<diggan> JAA: OK. could we donate to have another disk purchased and inserted? Or is just a problem of time and someone actually doing it?
<JAA> blueacid: Don't worry too much about that. Let me know if you everything under your user is dead, but otherwise, I can't easily release just those claims which are old. We'll requeue all claims probably on the weekend anyway.
<marked> 2 targets is not redundant architecture if we need both
<super2K> Http has to use an error (200) to let you know that it worked
<blueacid> JAA: OK, thanks - sorry about that! If an item is queued but then a warrior uploads it, does that automatically skip the queue and remove it from the queue?
<JAA> diggan: The problem is that we need upload permissions to the IA collection where the data belongs. I believe there are several people who have said they'd be willing to run a target. So we just need to wait until someone who can give us that upload permission (i.e. Jason) comes back. Or until the Prague machine (the one I mentioned above) is done with the previous data.
<titanous> how many parallel jobs can I run on a single IP? (assuming enough disk/bandwidth/CPU/memory)
<JAA> blueacid: I've been wondering about that too. I'm not sure what happens in that case.
<ultramage> um, I'm seeing 2 of my jobs seemingly trapped in a few specific /notes/ downloading various from_c= ... are these doomed, should I kill them? any way to tell the server that the job's been aborted?
<_niklas> titanous: disk/memory/cpu/bandwidth are the bigger problems, in that order
<diggan> JAA: I see. Thanks for explaining :)
<urjaman> ultramage: we started skipping notes sometime yesterday ...
<_niklas> tumblr doesn't seem to care how much you hammer it
<titanous> cool, I have about 10TB of disk I can use for this
<JAA> ultramage: If it's only doing notes, then you might want to let it finish. If it's still retrieving other stuff though, it might be worth killing and restarting with the most recent scripts.
<marked> if we have an rsync target now without permissions, can't it just drain when permission is granted ?
<Kaz> right
<Kaz> i hear you want items?
<diggan> currently I'm running 10 warriors per IP, with 6 in concurrency on each. not having any issues
<caff> I think the cap is like 700-800 crawlers per IP?
<einswenig> kiska: I tried the UA branch, my server in europe bounces on consent-redirect.
<JAA> Kaz: Nope, we need an upload target. FOS is getting slammed.
<Kaz> wild
<ultramage> yes I know, the job's been at it for 20 hours now, I started it before the fix got in and you didn't put any 'abort' button into the UI
<JAA> Harry's machine ran out of disk space, so it was disabled as a target. I'm not sure what the status on it is.
<einswenig> tumblr-blog:randomreduxsmut and tumblr-blog:hockeywives need reset
<Kaz> when did it run out? and who said it was a disk space issue?
<JAA> Someone was reporting rate-limiting issues at 600 crawlers on one IP IIRC.
<kiska> It wasn't out of space
<Kaz> it's sitting at 5tb free
<kiska> He wanted to replace the HDD with SSD's since the HDD array is getting thrashed
<Kaz> hah
<JAA> Oh, sorry, misunderstood that.
<JAA> And now FOS is getting thrashed. :-)
<kiska> xD
<Kaz> fos is in a permanent state of abuse
<kiska> Right, I'll need to regenerate that cookie
<JAA> Yeah, it's a very abusive relationship.
<arkiver> kiska: do we want logged in tumblrs?
<Kaz> HCross: ETA on if/when yours will be available?
<arkiver> just noticed the branch UAtesting
<arkiver> or is that for some other reason than logging in
<_niklas> way faster
<kiska> arkiver: Its not just logged in tumblr's, GoogleBot UA is also rate limiting us
<_niklas> googlebot UA gets us low priority responses
<arkiver> I see
<kiska> Silently, by making responses take a long time
<arkiver> nice find
<HCross> Kaz: later this evening
<marked> is this dangerous if that cookie gets invalidated?
<Kaz> cool cheers
<_niklas> as in used incorrectly?
<_niklas> doesn't seem to invalidate it
<marked> say, reset, then all crawlers will die simultaneously
<kiska> Well I could make it that it gets a new cookie every time pipeline.py runs, I guess
<marked> that sounds safer
<HCross> Kaz: at least 4 hours from here
<caff> y'all got enough logins for archiving logged in blogs?
<marked> then the u:p goes in the code?
<arkiver> I´m not sure if we want to go after blogs that require a login
<chauffer> diggan, how do you have so many compooters to run this on
<marked> user : pass
<arkiver> kiska: this is about not getting limited right? not about getting blogs requiring a login
<marked> _niklas remind us the difference betwen teh two cookies
<marked> while arkiver is here
<marked> and which blog types are affected
<JAA> arkiver: Why wouldn't we want to get login-required blogs? It's only a registration, not a manual approval or similar.
<hook54321> true
<kiska> ^
<JAA> So that content is practically still public.
<arkiver> SketchCow: what do we think of blogs that require a login to be viewed? also see JAA comments a few lines up
<marked> if we change UA from bot to browser, the crawl goes fast
<marked> but then it'll require 2 extra steps
<hook54321> Also, iirc some NSFW blogs require a login to be viewed as well?
<marked> GPDR and login required blogs
<_niklas> the pfg cookie is EU cookie consent
<_niklas> you absolutely need one if you're not presenting as a spider
<marked> so one combation is GPDR + BrowerUA, that gets us fast and non-login required blogs
<_niklas> it's user-agent specific (you can't use it with a different user agent than the one you had when tumblr gave one to you)
<hook54321> arkiver: I think he's asleep
<marked> it'll redirect on required login blogs
<_niklas> more optionally there's pfx, the session cookie, for login-required and marked nsfw blogs
<_niklas> the latter of which we can crawl without login by presenting as googlebot
<marked> some was details was added to the issue #2 https://github.com/ArchiveTeam/tumblr-grab/issues/2
<_niklas> >Well I could make it that it gets a new cookie every time pipeline.py runs
<_niklas> distributing the credentials then?
<kiska> Yes
<kiska> Unless you're willing to make throwaway ones
<kiska> At that point it'll also be restricted to people who run scripts
<Kaz> kiska: do you have an rsync target
<hook54321> If this were a smaller project with less people involved distributing credentials could maybe be fine, but it's attracted lots of new people.
<kiska> I need collection upload perms, but can set one up
<_niklas> _personally_ I'd be willing to provide my own account
<hook54321> Should probably be a throwaway, not personal
<_niklas> shouldn't be too hard to hack up the dockerfile to support that too
<_niklas> yeah I already made a throwaway for testing anyway
<kiska> Ask someone else to do that, since I have no clue how the docker stuff works
<Kaz> upload perms are the least important bit, is there a machine available
scrottie has joined #tumbledown
<_niklas> I can take care of that if that ends up being a thing we want to do and there's support in the scripts for it
<hook54321> Maybe we could have the script fetch login credentials from a specified URL every time it runs. That way we could update it without people restarting the whole warrior.
<marked> where the credentials are the cookie?
<hook54321> Wasn't thinking that, but that could work too
<hook54321> probably better that way
<marked> that might solve both problems, we can generate as many cookies at needed and not risk the U:P
<erin> oh yay! nothing OOMed overnight
<_niklas> ehh
<marked> if someone messes with a cookie the others ones are still valid
<_niklas> you risk the u/p anyway
<_niklas> actually hmm
<marked> we could also load multiple u/p in to the cookie factory
<_niklas> tumblr probably does require email confirmation for password changes
<_niklas> so that would be better
<kiska> I'll leave it up to arkiver to implement login + UA
<scrottie> tmgioct and pfx cookies change frequently. I think all of the pf* ones are involved in login. not sure how the dance works.
<kiska> It is 4:40am and I need to get a target up
<kiska> SketchCow: Can kiska3 have write perms to the collection? I am setting up a rsync target
bmcmath has quit [Remote host closed the connection]
<super2K> could one of the admins check blog:loves-little-girls and make sure there's no cp, and remove if necessary - uploaded by imperfectionest
<Kaz> pass
<SketchCow> hard pass
bmcmath has joined #tumbledown
<SketchCow> Also, fake blog name
<SketchCow> So likely nonexistent and removed a billion years ago and sitting in a list somewhere and we scooped all lists up
<super2K> Whew
bmcmath has quit [Remote host closed the connection]
<super2K> I was afraid I just uploaded somethign real bad
<ultramage> leaving just a single job running makes it go much faster... I'm seeing bursts up to 20
<marked> what units?
<urjaman> and "bursts" just makes me think those are the media parts
<caff> yeah, I'd think tumblr zapped all the cp blogs because those caused this entire mess
<diggan> chauffer: it's my secret sauce :)
<super2K> Makes sense. But the paranoia is strong
<boutique_> diggan: do we want to know about your secret sauce? :p
<SketchCow> As http://fos.textfiles.com/ARCHIVETEAM/ can show, tumblr sets are definitely coming in from FOS (obviously this does not track when a second or third host is also injecting)
<chauffer> diggan, if it's secret sauce i can use i hope you share it for the sake of archiving this shit
<chauffer> <:)
<diggan> i can post you some other secret sauce if you want?
<chauffer> sure
<ultramage> how's the staging server doing now?
<diggan> chauffer: not really, just have bunch of hardware
<SketchCow> FOS is slowly filling and hopefully will have other pals soon
<chauffer> kk
fuzzy8021 has joined #tumbledown
<_niklas> I wonder how many rsync slots are just clogged with large blog uploads
<_niklas> I have 10 gigs left on this one upload and it's going at 150kbyte/s right now D:
<diggan> 3 days left and 42444 (so far) items left. That's just 14148 items per day. Seems doable. We should get more items!
<erine> Just made btrfs do something very write intensive to hold off my out of space doom for a while, haha
<Lady> I've been linking the submission sheet everywhere I can think of
<erine> I'm about to lose a node that's doing a large upload!
* erine sweats
<_niklas> lose?
<marked> someone said the workers respond to the pause signal
<erine> 98% disk usage
<kiska> Urgh it takes like a billion years for scaleway to boot
<erine> I'm tip toeing the line between "oh fuck" and utter doom
<ultramage> oh damn... need to restart the network, will the rsync upload survive the outage? will it if I suspend the vm?
<_niklas> just kill a worker that's not uploading shit
<erine> kiska: Give your server 2-4 150G volumes while you still can
<ultramage> yes you can kill -TSTP <pid> to suspend individual jobs
<kiska> ... I gave them like 8 block devices, it didn't even mount them, time to delete and remake
<erine> Assigned in the volumes page of the node?
<urjaman> ultramage: uhh, i think the upload connection would be lost but it would retry if you suspend it...
<erine> "Attach an existing volume" should work
<urjaman> (if you shutdown it, all is lost)
<erine> given that you power off the node first
<marked> _niklas : your crawled URLs are slightly larger than GoogleBot and 'AchiveTeam' UA
tungol has quit [Quit: Leaving.]
<_niklas> slightly as in?
<psi> it apparently also takes 5000 years for scaleway to send a validation email
<super2K> If anyone is within an hour of Omaha, I can deliver a phyiscal server
<erine> welcome to scaleway - you'll eventually get an account and servers but that'll take a good while
<_niklas> I noticed some few kb differences between non-spider UAs
<kiska> It also takes them 5000 years to provision a node
<psi> heh
<super2K> SketchCow: I know you collect software, but does an imac g3 or an emac g4 or powermac g4 spark your interest?
<kiska> This is the second time a had to reprovision cause the block devices don't mount and won't mount
<kiska> I had this issue during their "beta" testing phase
<erine> adding the volume does not work?
<_niklas> it takes them about 50,000 to suspend and resume one
<kiska> Hopefully try number 3 works
<kiska> Yes "Attach existing device" refuses to work
<erine> What node type are you using? I believe that some node types have a 100G limit instead of a 150G
haveagr8d has joined #tumbledown
<kiska> C2L
<trvz> wait, are you seriously scaleway'ing this now
<kiska> Yep
<erine> ah, lol scaleway then
<kiska> If this fails, I will spin up a AWS instance with 100gbit link
<erine> I'll pray for your wallet.
Seong has joined #tumbledown
<_niklas> learning experience I suppose
<mbp> are you doing the rsync target?
<kiska> Yes
<kiska> I also have write perms
<_niklas> (really though, I'm curious how that'll perform)
<kiska> Luckily I have $100 in aws credit, so might not be that bad for my wallet
<kiska> I'll also try and use spot instancing
<trvz> enough to upload 1TB
<pnJay> I can PayPal you some too kiska
<erine> actually good idea with the spots; prices are at 1.16/h instead of 3.88/h!
<psi> ...
<psi> of course you can only pay with card at scaleway
<_niklas> can you just decommission an rsync target without data loss?
<SketchCow> super2K: No hardwre please and this is not the channel!
<kiska> Of course an error has occurred...
<kiska> Right.... AWS time!
<erine> c5n.18xlarge @ us-east-2 is at $.684/h but that seems like a huge trap since the other regions are at $3.88/h
<trvz> did you try digitalocean or vultr with their block volumes?
<Fusl> diggan: you're getting very close into my personal space http://xor.meo.ws/5a006ce8/c94d/4fc3/95e9/c9a4ec819d1f.png
<diggan> haha, I know! was looking at that as well
<kiska> DO and Vultr have lowish transfer, from memory so I want to avoid being charged extra for bandwidth
<diggan> you need to add some more warriors soon ;)
<mbp> the 1TiB club
<diggan> once another upload target is added, things will fly by is my guess
<trvz> perhaps you two could stop while there's this bandwidth issue going on?
<Fusl> kiska: BUT you're spinning up an AWS instance. have you ever looked at their bandwidth pricing?
<kiska> I would hold off on adding more warriors
<diggan> trvz: someone told me to keep it running in the meantime
<kiska> I am using AWS lightsail I have the credit for that
<diggan> i have not added/removed anything
<Fusl> better go with hetzner cloud
<trvz> it had 150Mbps upload to IA
<kiska> Not good upload
<super2K> SketchCow: My apologies
<SketchCow> I appreciate being offered hardware.
<blueacid> For AWS, really really really use lightsail for this
<blueacid> you get included bandwidth
<blueacid> otherwise egress is like $0.08/GB
<blueacid> which looks cheap at a glance but, run the numbers for 2TB of upload and get back to me :P
<erine> enough money for a dedi with OVH!
<blueacid> Hint: It's going to annihilate your $100 credit on that alone, without even paying for a VM or disk
<erine> enough money for two dedis with OVH!
<erine> or more?
Ryz has joined #tumbledown
<_niklas> enough money to just buy tumblr straight up so we don't have this problem? :^)
<super2K> Google offers a $300 credit with their cloud services when you sign up
<scrottie> Verizon declared it worthless which is I guess always how old people feel about youth culture
<erine> great for warriors, probably not great for the rsync ingress
<erine> google egress is just as bad
<Fusl> also do note that lightsail counts inbound and outbound data to your data transfer limit
<Fusl> unlike others where only outbound is calculated
<trvz> I expect you'll run into IO issue with lightsail
<trvz> as in, rate limitation by AWS
<erine> 0.01/GB in the US, 0.08-0.19/GB depending on the other region
<Fusl> that as well. they are general purpose EBS volumes with lots of iops limit
<Fusl> but burstable at a very high speed
<_niklas> erine: that 0.01 is within google cloud
<Kenshin> if there's a need for target, use rsync.archiveteam.kenshin.sg::archiveteam/tumblr/
<trvz> kiska: if you want, I can get a dedi at OVH - let me know in 30 min in which of their locations
<erine> 0.085/GB :(
<erine> just as bad for 2 TB of egress
<marked> what does this ahe to do besides receive rsync files ?
<pnJay> if we're signing up for cloud services we should be using referal accounts too, to maximize the free credit within the project
<Kenshin> kiska: rsync.archiveteam.kenshin.sg::archiveteam/tumblr/
Lady has quit [Quit: Leaving]
<Fusl> how much TB of storage would i need to run an rsync target for this?
<Fusl> wow, that's a lot of storage needed
<marked> _niklas : in 3 crawls various from 3956 3958 3957 , but nobody's check the crawls with the same code are exactly the same. I believe they're not with each page varying slightly
<Fusl> trvz: you happened to run one until it got full, right?
<marked> googlebot, google chrome, 'ArchiveTeam'
<_niklas> you're talking about req numbers in 9volt-art, yeah?
<marked> yes, approximately
<_niklas> my 'google chrome' crawl also hit one of those
<_niklas> (from memory, didn't save the log)
<marked> when I compared the urls coverage there was only 1 entry different
<_niklas> what was it?
<Kaz> Kenshin: thanks
<Kaz> Will add it now - is it just one box or the global set?
<Kenshin> Kaz: it's the usual global, 6x10TB spread globally, with a 60TB fallback in SG
<Kaz> thank you!
<_niklas> nice
<Kenshin> all nodes 10G connected
<Kenshin> i also have a server in SG that has 7TB SSDs for packing later on, someone just needs to deal with the upload later
<marked> WARC-Target-URI: http://9volt-art.tumblr.com/tagged/
<marked> our current master grabs
<kiska> How is bandwidth to IA or at least SF?
<Kaz> I can deal with upload if you want.. think my key was on the box if it's the same one used in the past?
<marked> but the chrome and googlebot do not hit on
<_niklas> holy shit look at the ticker go, did you add it already?
<Kenshin> Kaz: same box, but it's not the one with the SSD
<Kenshin> but you can still use it. that's the 60TB fallback box
<mbp> all of the bullshit 1mb blogs are flying in
<super2K> Holy microblog pukeout
<kiska> Yep new rsync target added
<kiska> I am going to add mine shortly as well
<Kaz> FOS disabled temporarily to smooth things out
<Kenshin> kiska: 10G to IA via LAX
<erine> hold my drink, i'm gonna drop a 12 GB warc
<kiska> Yay!
<Kenshin> though it usually tops out at around 2G or so
<super2K> Ernie: YOU CAN DO IT@
<Kaz> guess when the target got enabled.. https://usercontent.irccloud-cdn.com