Tuesday, October 30, 2018

Web scraper now working

I posted it to the right. There is some spaghetti, but it works, untested on large complex files, however. That is next. It came in at 200 lines of code, no headers all one file.  Works from a filename.    Some of the spaghetti can be optimized away, I am not worried. Most of the spaghetti is all about finding those tags.

I be testing it on ever larger files in a while. Here is the test file:

 <html>

 htmltext starting
 <body>
 <font> Font text

 <b> Bold text </b>
 <h1> header is now <b> bold</b></h1>
 </font>
 <script> Script text to be ignored</script>
 <h1> Another header is now <h> header embedded</></>
 Body text
 </body>
 html ending
 </html>

Simple but include beginning and trailing text and skipped text and nested html blocks. It return all the correct text.  I notice that my noetpad++ performs the same trick on html files, identifying plain text and bolding ii while keep script text unbolded.  So  I am not the first, and we prove, you do not need a full html parser to extract the plain text.

You do not actually need the tag identified in the closing "/>.", although browsers may require it.  My system assumes a consistent html grammar and there is no need to identify which closing tag belongs to which opening tag.

No comments: