Splitting HTML String Into Two Parts With HtmlAgilityPack

July 29, 2022 Post a Comment

I'm looking for the best way to split an HTML document over some tag in C# using HtmlAgilityPack. I want to preserve the intended markup as I'm doing the split. Here is an example.

Solution 1:

Definitely not by regex. (Note: this was originally a tag on the question—now removed.) I'm usually not one to jump on The Pony is Coming bandwagon, but this is one case in which regular expressions would be particularly bad.

First, I would write a recursive function that removes all siblings of a node that follow that node—call it RemoveSiblingsAfter(node)—and then calls itself on its parent, so that all siblings following the parent are removed as well (and all siblings following the grandparent, and so on). You can use an XPath to find the node(s) on which you want to split, e.g. doc.DocumentNode.SelectNodes("//a[@href='#']"), and call the function on that node. When done, you'd remove the splitting node itself, and that's it. You'd repeat these steps for a copy of the original document, except you'd implement RemoveSiblingsBefore(node) to remove siblings that precede a node.

In your example, RemoveSiblingsBefore would act as follows:

<a href="#"> has no siblings, so recurse on parent, <li>.
<li> has a preceding sibling—<li>Bullet 1</li>—so remove, and recurse on parent, <ul>.
<ul> has no siblings, so recurse on parent, <p>.
<p> has a preceding sibling—<p>Stuff</p>—so remove, and recurse on parent, <div>.
and so on.

Solution 2:

Here is what I came up with. This does the split and removes the "empty" elements of the element where the split happens.

    private static void SplitDocument()
    {
        var doc = new HtmlDocument();
        doc.Load("HtmlDoc.html");
        var links = doc.DocumentNode.SelectNodes("//a[@href]");
        var firstPart = GetFirstPart(doc.DocumentNode, links[0]).DocumentNode.InnerHtml;
        var secondPart = GetSecondPart(links[0]).DocumentNode.InnerHtml;
    }

    private static HtmlDocument GetFirstPart(HtmlNode currNode, HtmlNode link)
    {
        var nodeStack = new Stack<Tuple<HtmlNode, HtmlNode>>();        
        var newDoc = new HtmlDocument();
        var parent = newDoc.DocumentNode;

        nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(currNode, parent));

        while (nodeStack.Count > 0)
        {
            var curr = nodeStack.Pop();
            var copyNode = curr.Item1.CloneNode(false);
            curr.Item2.AppendChild(copyNode);

            if (curr.Item1 == link)
            {
                var nodeToRemove = NodeAndEmptyAncestors(copyNode);
                nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
                break;
            }

            for (var i = curr.Item1.ChildNodes.Count - 1; i >= 0; i--)
            {
                nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(curr.Item1.ChildNodes[i], copyNode));
            }
        }

        return newDoc;
    }

    private static HtmlDocument GetSecondPart(HtmlNode link)
    {
        var nodeStack = new Stack<HtmlNode>();
        var newDoc = new HtmlDocument();

        var currNode = link;
        while (currNode.ParentNode != null)
        {
            currNode = currNode.ParentNode;
            nodeStack.Push(currNode.CloneNode(false));
        }

        var parent = newDoc.DocumentNode;
        while (nodeStack.Count > 0)
        {
            var node = nodeStack.Pop();
            parent.AppendChild(node);
            parent = node;
        }

        var newLink = link.CloneNode(false);
        parent.AppendChild(newLink);

        currNode = link;
        var newParent = newLink.ParentNode;

        while (currNode.ParentNode != null)
        {
            var foundNode = false;
            foreach (var child in currNode.ParentNode.ChildNodes)
            {
                if (foundNode) newParent.AppendChild(child.Clone());
                if (child == currNode) foundNode = true;
            }

            currNode = currNode.ParentNode;
            newParent = newParent.ParentNode;
        }

        var nodeToRemove = NodeAndEmptyAncestors(newLink);
        nodeToRemove.ParentNode.RemoveChild(nodeToRemove);

        return newDoc;
    }

    private static HtmlNode NodeAndEmptyAncestors(HtmlNode node)
    {
        var currNode = node;
        while (currNode.ParentNode != null && currNode.ParentNode.ChildNodes.Count == 1)
        {
            currNode = currNode.ParentNode;
        }

        return currNode;
    }

Html5 Developer

Splitting HTML String Into Two Parts With HtmlAgilityPack

Solution 1:

Solution 2:

Post a Comment for "Splitting HTML String Into Two Parts With HtmlAgilityPack"