Skip to content Skip to sidebar Skip to footer

How Get Webpages Title When They Are Encoded Differently

I have a method that download a webpage and extract the title tag but depending of the website, the result can be encoded or in the wrong character set. Is there a bulletproof way

Solution 1:

The charset is not always present in the header so we must also check for the meta tags or if it's not there neither, fallback to UTF8 (or something else). Also, the title might be encoded so we just need to decode it.

Results

The code below come from the github project Abot. I have modified it a little bit.

privatestringGetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {
        HttpResponseMessage response = client.GetAsync(uri).Result;

        if (!response.IsSuccessStatusCode)
        {
            thrownew Exception(response.ReasonPhrase);
        }

        var contentStream = response.Content.ReadAsStreamAsync().Result;
        var charset = response.Content.Headers.ContentType.CharSet ?? GetCharsetFromBody(contentStream);                

        Encoding encoding = GetEncodingOrDefaultToUTF8(charset);
        string content = GetContent(contentStream, encoding);

        Match titleMatch = Regex.Match(content, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase);

        if (titleMatch.Success)
        {
            title = titleMatch.Groups["Title"].Value;

            // decode the title in case it have been encoded
            title = WebUtility.HtmlDecode(title).Trim();
        }
    }

    if (string.IsNullOrWhiteSpace(title))
    {
        title = uri.ToString();
    }

    return title;
}

privatestringGetContent(Stream contentStream, Encoding encoding)
{
    contentStream.Seek(0, SeekOrigin.Begin);

    using (StreamReader sr = new StreamReader(contentStream, encoding))
    {
        return sr.ReadToEnd();
    }
}

///<summary>/// Try getting the charset from the body content.///</summary>///<param name="contentStream"></param>///<returns></returns>privatestringGetCharsetFromBody(Stream contentStream)
{
    contentStream.Seek(0, SeekOrigin.Begin);

    StreamReader srr = new StreamReader(contentStream, Encoding.ASCII);
    string body = srr.ReadToEnd();
    string charset = null;

    if (body != null)
    {
        //find expression from : http://stackoverflow.com/questions/3458217/how-to-use-regular-expression-to-match-the-charset-string-in-html
        Match match = Regex.Match(body, @"<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s""']*)?([^>]*?)[\s""';]*charset\s*=[\s""']*([^\s""'/>]*)", RegexOptions.IgnoreCase);

        if (match.Success)
        {
            charset = string.IsNullOrWhiteSpace(match.Groups[2].Value) ? null : match.Groups[2].Value;
        }
    }

    return charset;
}

///<summary>/// Try parsing the charset or fallback to UTF8///</summary>///<param name="charset"></param>///<returns></returns>private Encoding GetEncodingOrDefaultToUTF8(string charset)
{
    Encoding e = Encoding.UTF8;

    if (charset != null)
    {
        try
        {
            e = Encoding.GetEncoding(charset);
        }
        catch
        {
        }
    }

    return e;
}

Solution 2:

you can try to get all bytes and convert to string with whatever encodng you want, just using Encoding class. It would be something like this:

privatestringGetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {

        var byteData = await client.GetByteArrayAsync(url);
        string html = Encoding.UTF8.GetString(byteData);

        title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
    }

    return title;
}

I hope it helps you and if does, please mark it as answer.

Solution 3:

This may help you out. Use globalization

using System;
using System.Globalization;

publicclassExample
{
    publicstaticvoidMain()
{
  string[] values = { "a tale of two cities", "gROWL to the rescue",
                      "inside the US government", "sports and MLB baseball",
                      "The Return of Sherlock Holmes", "UNICEF and         children"};

  TextInfo ti = CultureInfo.CurrentCulture.TextInfo;
  foreach (varvaluein values)
     Console.WriteLine("{0} --> {1}", value, ti.ToTitleCase(value));
   }
}

Check this out.https://msdn.microsoft.com/en-us/library/system.globalization.textinfo.totitlecase(v=vs.110).aspx

Post a Comment for "How Get Webpages Title When They Are Encoded Differently"