Logo 
Search:

Java Forum

Ask Question   UnAnswered
Home » Forum » Java       RSS Feeds

Extracting links from an html document

  Asked By: Bakir    Date: Jan 14    Category: Java    Views: 872
  


i'm trying to extract links from an html document using HtmlEditorKit(),
but it seems to be so fragile that when page is not well-formed, it throws
exceptions,
is there a way to make this more ignorant? or any other good way?, maybe a
regex?

Share: 

 

1 Answer Found

 
Answer #1    Answered By: Cay Nguyen     Answered On: Jan 14

Do try a regex . A quick google and I found this from
http://sastools.com/b2/index.php?m=2002&w=46

(?:[hH][rR][eE][fF]\s*=)
(?:[\s""']*)
(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.*css|.*this\.)


(.*?)(?:[\s>""'])

 
Didn't find what you were looking for? Find more on Extracting links from an html document Or get search suggestion and latest updates.




Tagged: