Using CefSharp.Offscreen to retrieve a web page that requires Javascript to render

I know I am doing some archaeology reviving a 2yo post, but a detailed answered may be of use for someone else.

So yes, Cefsharp.Offscreen is fit to the task.

Here under is a class which will handle all the browser activity.

using System;
using System.IO;
using System.Threading;
using CefSharp;
using CefSharp.OffScreen;

namespace [whatever]
{
    public class Browser
    {

        /// <summary>
        /// The browser page
        /// </summary>
        public ChromiumWebBrowser Page { get; private set; }
        /// <summary>
        /// The request context
        /// </summary>
        public RequestContext RequestContext { get; private set; }

        // chromium does not manage timeouts, so we'll implement one
        private ManualResetEvent manualResetEvent = new ManualResetEvent(false);

        public Browser()
        {
            var settings = new CefSettings()
            {
                //By default CefSharp will use an in-memory cache, you need to     specify a Cache Folder to persist data
                CachePath =     Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData), "CefSharp\\Cache"),
            };

            //Autoshutdown when closing
            CefSharpSettings.ShutdownOnExit = true;

            //Perform dependency check to make sure all relevant resources are in our     output directory.
            Cef.Initialize(settings, performDependencyCheck: true, browserProcessHandler: null);

            RequestContext = new RequestContext();
            Page = new ChromiumWebBrowser("", null, RequestContext);
            PageInitialize();
        }

        /// <summary>
        /// Open the given url
        /// </summary>
        /// <param name="url">the url</param>
        /// <returns></returns>
        public void OpenUrl(string url)
        {
            try
            {
                Page.LoadingStateChanged += PageLoadingStateChanged;
                if (Page.IsBrowserInitialized)
                {
                    Page.Load(url);

                    //create a 60 sec timeout 
                    bool isSignalled = manualResetEvent.WaitOne(TimeSpan.FromSeconds(60));
                    manualResetEvent.Reset();

                    //As the request may actually get an answer, we'll force stop when the timeout is passed
                    if (!isSignalled)
                    {
                        Page.Stop();
                    }
                }
            }
            catch (ObjectDisposedException)
            {
                //happens on the manualResetEvent.Reset(); when a cancelation token has disposed the context
            }
            Page.LoadingStateChanged -= PageLoadingStateChanged;
        }

        /// <summary>
        /// Manage the IsLoading parameter
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void PageLoadingStateChanged(object sender, LoadingStateChangedEventArgs e)
        {
            // Check to see if loading is complete - this event is called twice, one when loading starts
            // second time when it's finished
            if (!e.IsLoading)
            {
                manualResetEvent.Set();
            }
        }

        /// <summary>
        /// Wait until page initialization
        /// </summary>
        private void PageInitialize()
        {
            SpinWait.SpinUntil(() => Page.IsBrowserInitialized);
        }
    }
}

Now in my app I just need to do the following:

public MainWindow()
{
    InitializeComponent();
    _browser = new Browser();
}

private async void GetGoogleSource()
{
    _browser.OpenUrl("http://icanhazip.com/");
    string source = await _browser.Page.GetSourceAsync();
}

And here is the string I get

"<html><head></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">NotGonnaGiveYouMyIP:)\n</pre></body></html>"

If you can't get a headless version of Chromium to help you, you could try node.js and jsdom. Easy to install and play with once you have node up and running. You can see simple examples on Github README where they pull down a URL, run all javascript, including any custom javascript code (example: jQuery bits to count some type of elements), and then you have the HTML in memory to do what you want. You can just do $('body').html() and get a string, like in your pseudo code. (This even works for stuff like generating SVG graphics since that is just more XML tree nodes.)

If you need this as part of a larger C# app that you need to distribute, your idea to use CefSharp.Offscreen sounds reasonable. One approach might be to get things working with CefSharp.WinForms or CefSharp.WPF first, where you can literally see things, then try CefSharp.Offscreen later when this all works. You can even get some JavaScript running in the on-screen browser to pull down body.innerHTML and return it as a string to the C# side of things before you go headless. If that works, the rest should be easy.

Perhaps start with CefSharp.MinimalExample and get that compiling, then tweak it for your needs. You need to be able to set webBrowser.Address in your C# code, and you need to know when the page has Loaded, then you need to call webBrowser.EvaluateScriptAsync(".. JS code ..") with your JavaScript code (as a string) which will do something as described (returning bodyElement.innerHTML as a string).

Using CefSharp.Offscreen to retrieve a web page that requires Javascript to render

Tags:

Javascript

C#

Headless Browser

Cefsharp

Cefsharp.Offscreen

Related

Recent Posts