| GordonFreeman | hi |
| rindolf | Hi GordonFreeman |
| GordonFreeman | grep -Po '(?<=<a )(?<! href=)(?<= href=["]*)[^">]+' <<< '<a gfasg href=asdf>' |
| GordonFreeman | grep: lookbehind assertion is not fixed length |
| rindolf | GordonFreeman: grep is PCRE - it's not Perl. |
| rindolf | perlbot: pcre |
| Altreus | GordonFreeman: don't use regex for HTML |
| perlbot | rindolf: PCRE is not Perl. It lacks several features of Perl regexes. Don't bother asking for help with a PCRE pattern in a Perl channel as the answers will not be relevant. Try #regex, or the channel for your language. See also http://en.wikipedia.org/wiki/PCRE#Differences_from_Perl and LPBD. |
| GordonFreeman | but this should work i think. |
| mauke | no, it shouldn't |
| GordonFreeman | though it fails at the second lookbehind ... |
| mauke | no, it doesn't |
| GordonFreeman | and fails at "* too |
| GordonFreeman | (grep -Po '<a +.* +href="*[^" >]+' | grep -Po '(?=<a ).*' | grep -Po '(?<= href=)["]*[^" >]+') <<< '<a gfasg href=asdf><a fgfgg="hi> " href="link" >' |
| GordonFreeman | this works. |
| mauke | GordonFreeman: dude. |
| anno | don't paste! |
| GordonFreeman | hi mauke |
| apeiron | where's mauke's car? |
| rindolf | apeiron: :-) |
| mauke | it's a cdr |
| Altreus | I watched that the other day |
| rindolf | pkrumins: what's up? |
| Altreus | I don't really know why |
| mauke | GordonFreeman: go to a channel where that is on-topic |
| GordonFreeman | mauke<< like? |
| mauke | no idea |
| Altreus | where on earth is parsing HTML with regexes on topic? |
| GordonFreeman | ahem ok |
| Altreus | except ##php lolol |
| GordonFreeman | well i think one can see its logical and it works like this |
| rindolf | GordonFreeman: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 |
| shorten | rindolf's url is at http://xrl.us/bf4jh6 |
| apeiron | GordonFreeman, also, -P isn't perl. |
| thrig | Altreus: some special level of hell, between the angry ghosts and the hungry ghosts |
| rindolf | perlbot: html |
| apeiron | the grep docs lie to you. |
| perlbot | rindolf: Don't parse or modify html with regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response begins "that's overkill. i only want to..." you are wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and http://xrl.us/bf4jh6 for why not to use regex on HTML |
| LeoNerd | Altreus: Why, surely in #html-parsing-by-regexp |
| Altreus | if you want perl regex use ack |
| Altreus | surely |
| rindolf | LeoNerd: sounds like programmers' hell. |
| anno | perl regex doesn't support variable-length lookbehind either |
| Altreus | apeiron: actually it says it's highly experimental and hence not working |
| Altreus | it could well be Perl and not PCRE when finished :) |
| Altreus | not that "perl regex" is a defined term, the speed Perl is moving |
| yrlnry | That's why you should never use Perl's builtin regexes. Just write your own package, it's sure to be more reliable. |
| rindolf | yrlnry: :-) |
| talexb | Heh. |
| LeoNerd | use re::engine::vim; |
| rindolf | yrlnry++ |
| Altreus | LeoNerd: is it core? |
| yrlnry | HOP has a nice implementation. It works by generating a list of every string matched by the regex, and looking to see if your target string is in the list. |
| LeoNerd | I can't help thinking that may not be optimal in terms of CPU or memory usage |
| talexb | yrlnry, no doubt they have a Cray working on generating the list .. |
| yrlnry | LeoNerd: Depends; unlike Perl regexes, it has no trouble handling languages higher up the Chomsky hierarchy |
| yrlnry | It is guaranteed to return the right answer for any recursive language, and guaranteed to return correct 'matched' answers for any recursively enumerable language. |
| LeoNerd | Oh sure... |
| LeoNerd | In terms of CS guarantees it's very nice |
| yrlnry | So if you are in a big hurry to get the wrong answer... |
| LeoNerd | But I live in the practical pragmatic world |
| LeoNerd | E.g. Parser::MGC is horribly slow at backtracking and whatnot, but I write parsers in it because those are still fast for "reasonably" sized inputs, parsers are fast to write, and I like having lots of side-effects and dynamic logic -in- Perl |
| Altreus | Unfortunately my universe doesn't have infinite processing speeds and data storage |
| anno | a universe with infinite processing speed would have processed you by now |
| Altreus | and |
| Altreus | would have processed my grandchildren too |
| yrlnry | This algorithm doesn' t need infinite speed or storage. |
| yrlnry | It works slowly, but finitely. |
| Altreus | what |
| yrlnry | The infinite list is lazily generated and you never have more than one of its elements in memory at any time. |
| rindolf | yrlnry: is it sorted by length? |
| yrlnry | You will learn this sort of technique after you have been programming in Perl for eight months or so. |
| Altreus | how do you know when it doesn't match |
| Altreus | yrlnry: :D |
| yrlnry | rindolf: it is sorted by length, and lexicographically among strings of the same length. |
| rindolf | yrlnry: ah. |
| yrlnry | Of course, you cannot do the length-sorting thing for arbitrary languages, but for regex languages there is no trouble. |
| yrlnry | http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf |
| LeoNerd | Eh.. |
| LeoNerd | I dunno. I just dislike purely RE-based parsing |
| LeoNerd | I much prefer code doing it |
| GordonFreeman | why can't perl regexp do variable length lookbehind matching? |
| Altreus | See originally I ignored you because it sounded like you were talking shit |
| LeoNerd | Limit of the implementation |
| Altreus | mainly because it is possible to construct a regex with an infinite range that nevertheless won't match a particular string |
| anno | GordonFreeman: who knows? looks like it's hard to implement with the given engine |
| mauke | GordonFreeman: unclear semantics and no one's bothered to write the code |
| GordonFreeman | i see |
| Altreus | Plus, there's a fucking lot of Unicode to create strings out of |
| LeoNerd | It's not "hard" to implement. It's impossible given the algorithm being used |
| mauke | LeoNerd: why impossible? |
| yrlnry | LeoNerd: I don't think that's true. It could be done using a recursive call to the regex engine now that that is possible. |
| GordonFreeman | but lookbehind is cool |
| LeoNerd | Oooh.. yes.. I suppose it could do that now |
| GordonFreeman | its like a reverse regexp that can be excluded |
| anno | vim re's do it |
| LeoNerd | vim uses a different type of engine |
| anno | right |
| yrlnry | Altreus: I was talking shit. After eight months you get a license to do that. |
| mauke | really? |
| Altreus | yrlnry: but there's a pdf |
| yrlnry | where's a PDF? |
| Altreus | 17:10 < yrlnry> http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf |
| yrlnry | Yes. |
| Altreus | I didn't open it or anything |
| mauke | no one opens PDFs |
| yrlnry | PDFs are for cowards and Slavs. |
| Altreus | but it lent enough credence to your words that I decided to believe your spurious claims |
| Altreus | Actually someone did a test the other day |
| yrlnry | Oh, does "talking shit" mean "making up nonsense"? Then I was not talking shit. |
| Altreus | He linked someone to articles supporting his viewpoint and they changed their mind |
| yrlnry | It is in section 6.5, "regex string generation". |
| Altreus | but one of the articles was an argument against himself |
| Altreus | Showing that it is enough to cite your sources to be believed; not many people will actually bother to check them |
| Altreus | yrlnry: what do you normally think "talking shit" means? |
| Altreus | are you confusing it with shooting the shit |
| yrlnry | I'm not sure. |
| Altreus | are you foreign |
| yrlnry | Yes. |
| Altreus | ok then |
| mauke | hahaha |