I be testing it on ever larger files in a while. Here is the test file:
<html>
htmltext starting
<body>
<font> Font text
<b> Bold text </b>
<h1> header is now <b> bold</b></h1>
</font>
<script> Script text to be ignored</script>
<h1> Another header is now <h> header embedded</></>
Body text
</body>
html ending
</html>
Simple but include beginning and trailing text and skipped text and nested html blocks. It return all the correct text. I notice that my noetpad++ performs the same trick on html files, identifying plain text and bolding ii while keep script text unbolded. So I am not the first, and we prove, you do not need a full html parser to extract the plain text.
You do not actually need the tag identified in the closing "/>.", although browsers may require it. My system assumes a consistent html grammar and there is no need to identify which closing tag belongs to which opening tag.
No comments:
Post a Comment