Skip to content Skip to sidebar Skip to footer

Removing Styling From Html

I have a database full of product descriptions that have been entered riddled with horrible computer generated HTML and littered with different styling information...style attribut

Solution 1:

I would recommend using XSLT to strip off all unwanted content. A simple identity template would be a good starting point.

Solution 2:

What about php's strip_tags function?

The annoying part is you'll have to pass every tag you want to preserve in an array, but you only have to write it once.

For removing the tag attributes, bgcolor, etc. Somebody made this function here which could be worth a look, but mind the dodgy double-quotes on that page. There's a link at the bottom to download the code without wordpress formatting.

Solution 3:

Thanks to @Paul's idea here is an example in Excel. This is very rough and also needs to be modified depending on how you are storing your HTML in Excel; but hopefully it will get you started.

This example supposes a few things:

  1. You have first installed the TidyATL COM object (click the link that says 'wrapper'; you can register it on 64-bit Win 7 by first copying the DLL into C:\Windows\SysWOW64 and running regsvr32 C:\Windows\SysWOW64\TidyATL.dll).

  2. Your Excel Project has references to Microsoft XML 6.0, and Tidy 1.0 Type Library

  3. Your HTML is stored in Cell A1 of Sheet 1. Results are put into Cell B1. You can easily extend this idea to iterate through all used cells in a column and process all the HTML at once.

  4. I have zero experience writing XSLT. I ripped the 'identity template' directly from here. I had never used XSLT before today; so maybe someone who knows it can edit the XSLT to strip out the <font> nodes. This example just strips out all of the attributes.

This uses Tidy HTML to convert your ugly HTML into XHTML, then applies an XSLT template to the result.

EDIT: sorry, screwed up the "match" attribute in the XSLT. Was: match='@*|node()' should be: match='node()'

Here's the code I used:

Sub TidyUp()

    Dim t As TidyATL.TidyDocument

    Dim sXSLT

    sXSLT = "<?xml version='1.0' encoding='ISO-8859-1'?>" & _
        "<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>" & _
        "<xsl:template match='node()'>" & _
        "  <xsl:copy>" & _
        "    <xsl:apply-templates select='node()'/>" & _
        "  </xsl:copy>" & _
        "</xsl:template>" & _
        "</xsl:stylesheet>"Set t = New TidyATL.TidyDocument
    t.ParseString Sheet1.Range("A1").Value
    t.SetOptBool TidyXmlOut, True
    t.SetOptBool TidyXhtmlOut, True
    t.SetOptBool TidyNumEntities, True
    t.SetOptBool TidyXmlDecl, True

    t.CleanAndRepair


    Dim x As MSXML2.DOMDocument
    Dim x2 As MSXML2.FreeThreadedDOMDocument
    Dim xe As MSXML2.IXMLDOMParseError
    Set x = New MSXML2.DOMDocument
    Set x2 = New MSXML2.FreeThreadedDOMDocument

    'Load XHTML into a DOM
    x.LoadXML t.SaveString
    Set xe = x.parseError
    If xe.ErrorCode <> 0Then
        MsgBox "Err: " & xe.reason
        EndEndIf'Load XSLT into a DOM
    x2.LoadXML sXSLT
    Set xe = x2.parseError
    If xe.ErrorCode <> 0Then
        MsgBox "Err: " & xe.reason
        EndEndIfDim xt As XSLTemplate
    Set xt = New XSLTemplate
    Set xt.stylesheet = x2

    Dim xp As IXSLProcessor

    Set xp = xt.createProcessor
    xp.input = x
    xp.transform

    Sheet1.Range("B1").Value = xp.output
EndSub

Here's the result (still ugly but with no attributes):

<?xml version="1.0" encoding="UTF-16"?><htmlxmlns="http://www.w3.org/1999/xhtml"><head><meta></meta><title></title></head><body><div><table><tbody><tr><td><p><font><b>Mont
Blanc Scott Roof mounted cycle bike carrier<br></br><br></br>
 Part Number: 728540</b></font></p></td><td><a><img></img></a></td></tr><tr><td><b><font><script>
//
    &lt;!--function click() { if (event.button==2) { alert('All graphics, descriptions and other information, including the HTML code of this listing are the property of XXXX Limited and may not be reproduced in any form without the express permission of XXXX Limited. Email us: sales@XXXX.com'); } } document.onmousedown=click // --&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt;&lt;!----&gt; --&gt;
//</script></font></b><div><center><table><tbody><tr><td><p><img></img></p></td><td><p><font><u><strong>Mont Blanc</strong></u></font><u><strong><font>Scott Roof
Bar Rack 1 Cycle Carrier</font></strong></u></p></td><td><img></img></td></tr><tr><td><hr></hr><p><img></img></p><p><font><b>Scott</b></font></p><ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive
oval carrying bar.<br></br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></br></li><li>Extra wide wheel holders take the fattest tyres<br></br></li><li>Strong Webbing straps fasten wheels securely to
carrier<br></br></li><li><font>Upright, roof bar mounted, locking
cycle carrier<br></br></font></li><li><font> Locks to roof rails and
locks bikes<br></br></font></li><li><font> Quick and easy to
use<br></br></font></li><li><font>Adjustable for most cycle
styles</font></li></ul><center><table><tbody><tr><td><p><a><img></img></a></p></td><td>To view Fitting Instructions in
PDF format please click the spanner</td></tr></tbody></table><table><tbody><tr><td><font>Technical data</font></td><td><p><font>Mont</font> Blanc Scott</p><p><img></img></p></td></tr><tr><td><div>Max number of bikes</div></td><td><div>1</div></td></tr><tr><td><div>Load capacity (kg)</div></td><td><div>15 KG</div></td></tr><tr><td><div>Weight (kg)</div></td><td><div>2.2KG</div></td></tr><tr><td><div>Fits frame-dimensions (mm)</div></td><td>Up to 80mm</td></tr><tr><td><div>Fits wheel-dimensions</div></td><td><div>All</div></td></tr><tr><td><div>Locks bikes to carrier</div></td><td><div>Yes</div></td></tr><tr><td><div>Locks carrier to car</div></td><td><div>Yes</div></td></tr><tr><td><div>Tilt function, with bikes</div></td><td><div>NA</div></td></tr><tr><td><div>TÜV/EuroBE approved</div></td><td><div>NA</div></td></tr><tr><td><div>Fullfills City Crash norms</div></td><td><div>NA</div></td></tr><tr><td><div>Miscellaneous</div></td><td><div><p>Fits all types of Roof Bars,</p></div></td></tr></tbody></table><p><font>The cycle carrier
is guaranteed for Five year from date of purchase.<br></br><br></br>
 We stock a wide range of towbars and towing accessories.
<a><br></br>
Click here to email us</a> if you require details of our other
towing equipment.</font></p><hr></hr></center></td></tr></tbody></table></center></div><b><br></br>
 Please note that with the Type of cycle carrier where you mount
it<br></br>
 onto a flange ball you may need the long reach ball which
will<br></br>
 allow you enough clearance from the bumper</b></td></tr><tr><td><a><img></img></a><b><font>Not from the UK ? Click
   the flag to purchase this item from our EU site</font></b><a><img></img></a></td></tr></tbody></table></div></body></html>

EDIT: This XSLT seems to do the trick; it removes some tags with their content, and some tags without their content, whichever you specify. Again maybe someone with some XSLT knowledge can elaborate.

<?xml version='1.0' encoding='ISO-8859-1'?><xsl:stylesheetversion='1.0'xmlns:xsl='http://www.w3.org/1999/XSL/Transform'xmlns:xhtml="http://www.w3.org/1999/xhtml" ><xsl:templatematch='node()|@*'><xsl:copy><xsl:apply-templatesselect='node()'/></xsl:copy></xsl:template><!--these tags will be removed with their content--><xsl:templatematch='xhtml:script|xhtml:head'/><!--these tags will be removed but keep their content--><xsl:templatematch='xhtml:font|xhtml:p|xhtml:b|xhtml:u|xhtml:i|xhtml:center|xhtml:a|xhtml:img|xhtml:strong'><xsl:apply-templates/></xsl:template></xsl:stylesheet>

Result:

<?xml version="1.0" encoding="UTF-16"?><htmlxmlns="http://www.w3.org/1999/xhtml"><body><div><table><tbody><tr><td>Mont
Blanc Scott Roof mounted cycle bike carrier<br></br><br></br>
 Part Number: 728540</td><td></td></tr><tr><td><div><table><tbody><tr><td></td><td>Mont BlancScott Roof
Bar Rack 1 Cycle Carrier</td><td></td></tr><tr><td><hr></hr>Scott<ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive
oval carrying bar.<br></br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></br></li><li>Extra wide wheel holders take the fattest tyres<br></br></li><li>Strong Webbing straps fasten wheels securely to
carrier<br></br></li><li>Upright, roof bar mounted, locking
cycle carrier<br></br></li><li> Locks to roof rails and
locks bikes<br></br></li><li> Quick and easy to
use<br></br></li><li>Adjustable for most cycle
styles</li></ul><table><tbody><tr><td></td><td>To view Fitting Instructions in
PDF format please click the spanner</td></tr></tbody></table><table><tbody><tr><td>Technical data</td><td>Mont Blanc Scott</td></tr><tr><td><div>Max number of bikes</div></td><td><div>1</div></td></tr><tr><td><div>Load capacity (kg)</div></td><td><div>15 KG</div></td></tr><tr><td><div>Weight (kg)</div></td><td><div>2.2KG</div></td></tr><tr><td><div>Fits frame-dimensions (mm)</div></td><td>Up to 80mm</td></tr><tr><td><div>Fits wheel-dimensions</div></td><td><div>All</div></td></tr><tr><td><div>Locks bikes to carrier</div></td><td><div>Yes</div></td></tr><tr><td><div>Locks carrier to car</div></td><td><div>Yes</div></td></tr><tr><td><div>Tilt function, with bikes</div></td><td><div>NA</div></td></tr><tr><td><div>TÜV/EuroBE approved</div></td><td><div>NA</div></td></tr><tr><td><div>Fullfills City Crash norms</div></td><td><div>NA</div></td></tr><tr><td><div>Miscellaneous</div></td><td><div>Fits all types of Roof Bars,</div></td></tr></tbody></table>The cycle carrier
is guaranteed for Five year from date of purchase.<br></br><br></br>
 We stock a wide range of towbars and towing accessories.
<br></br>
Click here to email us if you require details of our other
towing equipment.<hr></hr></td></tr></tbody></table></div><br></br>
 Please note that with the Type of cycle carrier where you mount
it<br></br>
 onto a flange ball you may need the long reach ball which
will<br></br>
 allow you enough clearance from the bumper</td></tr><tr><td>Not from the UK ? Click
   the flag to purchase this item from our EU site</td></tr></tbody></table></div></body></html>

Solution 4:

This regex should give you expected results, but I haven't tested it:

preg_replace('/(<.*)(style=\".*\")(.*>)/', '{$1}{$3}', $yourhtml);

Solution 5:

I think that the regex needed could be much simpler than you are imagining, but then again, I don't know what the product descriptions are like. What are the chances of encountering < and > in the descriptions, aside from as part of HTML tags? If the chances are very small, could something like this not do the trick?

$new_description = preg_replace('/<([\w_ '"])+>/', '', $description);

Post a Comment for "Removing Styling From Html"