GordonFreeman | hi |
rindolf | Hi GordonFreeman |
GordonFreeman | grep -Po '(?<=<a )(?<! href=)(?<= href=["]*)[^">]+' <<< '<a gfasg href=asdf>' |
GordonFreeman | grep: lookbehind assertion is not fixed length |
rindolf | GordonFreeman: grep is PCRE - it's not Perl. |
rindolf | perlbot: pcre |
Altreus | GordonFreeman: don't use regex for HTML |
perlbot | rindolf: PCRE is not Perl. It lacks several features of Perl regexes. Don't bother asking for help with a PCRE pattern in a Perl channel as the answers will not be relevant. Try #regex, or the channel for your language. See also http://en.wikipedia.org/wiki/PCRE#Differences_from_Perl and LPBD. |
GordonFreeman | but this should work i think. |
mauke | no, it shouldn't |
GordonFreeman | though it fails at the second lookbehind ... |
mauke | no, it doesn't |
GordonFreeman | and fails at "* too |
GordonFreeman | (grep -Po '<a +.* +href="*[^" >]+' | grep -Po '(?=<a ).*' | grep -Po '(?<= href=)["]*[^" >]+') <<< '<a gfasg href=asdf><a fgfgg="hi> " href="link" >' |
GordonFreeman | this works. |
mauke | GordonFreeman: dude. |
anno | don't paste! |
GordonFreeman | hi mauke |
apeiron | where's mauke's car? |
rindolf | apeiron: :-) |
mauke | it's a cdr |
Altreus | I watched that the other day |
rindolf | pkrumins: what's up? |
Altreus | I don't really know why |
mauke | GordonFreeman: go to a channel where that is on-topic |
GordonFreeman | mauke<< like? |
mauke | no idea |
Altreus | where on earth is parsing HTML with regexes on topic? |
GordonFreeman | ahem ok |
Altreus | except ##php lolol |
GordonFreeman | well i think one can see its logical and it works like this |
rindolf | GordonFreeman: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 |
shorten | rindolf's url is at http://xrl.us/bf4jh6 |
apeiron | GordonFreeman, also, -P isn't perl. |
thrig | Altreus: some special level of hell, between the angry ghosts and the hungry ghosts |
rindolf | perlbot: html |
apeiron | the grep docs lie to you. |
perlbot | rindolf: Don't parse or modify html with regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response begins "that's overkill. i only want to..." you are wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and http://xrl.us/bf4jh6 for why not to use regex on HTML |
LeoNerd | Altreus: Why, surely in #html-parsing-by-regexp |
Altreus | if you want perl regex use ack |
Altreus | surely |
rindolf | LeoNerd: sounds like programmers' hell. |
anno | perl regex doesn't support variable-length lookbehind either |
Altreus | apeiron: actually it says it's highly experimental and hence not working |
Altreus | it could well be Perl and not PCRE when finished :) |
Altreus | not that "perl regex" is a defined term, the speed Perl is moving |
yrlnry | That's why you should never use Perl's builtin regexes. Just write your own package, it's sure to be more reliable. |
rindolf | yrlnry: :-) |
talexb | Heh. |
LeoNerd | use re::engine::vim; |
rindolf | yrlnry++ |
Altreus | LeoNerd: is it core? |
yrlnry | HOP has a nice implementation. It works by generating a list of every string matched by the regex, and looking to see if your target string is in the list. |
LeoNerd | I can't help thinking that may not be optimal in terms of CPU or memory usage |
talexb | yrlnry, no doubt they have a Cray working on generating the list .. |
yrlnry | LeoNerd: Depends; unlike Perl regexes, it has no trouble handling languages higher up the Chomsky hierarchy |
yrlnry | It is guaranteed to return the right answer for any recursive language, and guaranteed to return correct 'matched' answers for any recursively enumerable language. |
LeoNerd | Oh sure... |
LeoNerd | In terms of CS guarantees it's very nice |
yrlnry | So if you are in a big hurry to get the wrong answer... |
LeoNerd | But I live in the practical pragmatic world |
LeoNerd | E.g. Parser::MGC is horribly slow at backtracking and whatnot, but I write parsers in it because those are still fast for "reasonably" sized inputs, parsers are fast to write, and I like having lots of side-effects and dynamic logic -in- Perl |
Altreus | Unfortunately my universe doesn't have infinite processing speeds and data storage |
anno | a universe with infinite processing speed would have processed you by now |
Altreus | and |
Altreus | would have processed my grandchildren too |
yrlnry | This algorithm doesn' t need infinite speed or storage. |
yrlnry | It works slowly, but finitely. |
Altreus | what |
yrlnry | The infinite list is lazily generated and you never have more than one of its elements in memory at any time. |
rindolf | yrlnry: is it sorted by length? |
yrlnry | You will learn this sort of technique after you have been programming in Perl for eight months or so. |
Altreus | how do you know when it doesn't match |
Altreus | yrlnry: :D |
yrlnry | rindolf: it is sorted by length, and lexicographically among strings of the same length. |
rindolf | yrlnry: ah. |
yrlnry | Of course, you cannot do the length-sorting thing for arbitrary languages, but for regex languages there is no trouble. |
yrlnry | http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf |
LeoNerd | Eh.. |
LeoNerd | I dunno. I just dislike purely RE-based parsing |
LeoNerd | I much prefer code doing it |
GordonFreeman | why can't perl regexp do variable length lookbehind matching? |
Altreus | See originally I ignored you because it sounded like you were talking shit |
LeoNerd | Limit of the implementation |
Altreus | mainly because it is possible to construct a regex with an infinite range that nevertheless won't match a particular string |
anno | GordonFreeman: who knows? looks like it's hard to implement with the given engine |
mauke | GordonFreeman: unclear semantics and no one's bothered to write the code |
GordonFreeman | i see |
Altreus | Plus, there's a fucking lot of Unicode to create strings out of |
LeoNerd | It's not "hard" to implement. It's impossible given the algorithm being used |
mauke | LeoNerd: why impossible? |
yrlnry | LeoNerd: I don't think that's true. It could be done using a recursive call to the regex engine now that that is possible. |
GordonFreeman | but lookbehind is cool |
LeoNerd | Oooh.. yes.. I suppose it could do that now |
GordonFreeman | its like a reverse regexp that can be excluded |
anno | vim re's do it |
LeoNerd | vim uses a different type of engine |
anno | right |
yrlnry | Altreus: I was talking shit. After eight months you get a license to do that. |
mauke | really? |
Altreus | yrlnry: but there's a pdf |
yrlnry | where's a PDF? |
Altreus | 17:10 < yrlnry> http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf |
yrlnry | Yes. |
Altreus | I didn't open it or anything |
mauke | no one opens PDFs |
yrlnry | PDFs are for cowards and Slavs. |
Altreus | but it lent enough credence to your words that I decided to believe your spurious claims |
Altreus | Actually someone did a test the other day |
yrlnry | Oh, does "talking shit" mean "making up nonsense"? Then I was not talking shit. |
Altreus | He linked someone to articles supporting his viewpoint and they changed their mind |
yrlnry | It is in section 6.5, "regex string generation". |
Altreus | but one of the articles was an argument against himself |
Altreus | Showing that it is enough to cite your sources to be believed; not many people will actually bother to check them |
Altreus | yrlnry: what do you normally think "talking shit" means? |
Altreus | are you confusing it with shooting the shit |
yrlnry | I'm not sure. |
Altreus | are you foreign |
yrlnry | Yes. |
Altreus | ok then |
mauke | hahaha |