Previously was Google Language API for language detection but it is now paid. I found an alternative way to detect the language of text using Text_LanguageDetect pear package with 52 supported languages. Here is lithuanian text language detection example with list of supported languages:
<? header('Content-Type: text/plain; charset=utf-8'); require_once 'Text/LanguageDetect.php'; $l = new Text_LanguageDetect(); //example text for language detection $text = 'O mergina, vienplaukė, palaidomis kasomis, atlapu kaklu, smulkiu, bet tvirtu žingsniu ėjo toliau, artyn prie ežero.'; //Detects the closeness of a sample of text to the known languages $result = $l->detect($text, 4); print_r($result); //Returns the distribution of unicode blocks in a given utf8 string $blocks = $l->detectUnicodeBlocks($text, true); print_r($blocks); //language name $l->setNameMode(0); echo $l->detectSimple($text)."\n"; //ISO 639-1 two-letter language code $l->setNameMode(2); echo $l->detectSimple($text)."\n"; //ISO 639-2 three-letter language code $l->setNameMode(3); echo $l->detectSimple($text)."\n"; //Supported languages list $l->setNameMode(0); echo "Supported languages:\n"; $langs = $l->getLanguages(); sort($langs); print_r($langs); //Total amount of supported languages echo count($langs);
Output:
Array ( [lithuanian] => 0.24584192439863 [latvian] => 0.19567010309278 [estonian] => 0.11316151202749 [dutch] => 0.11240549828179 ) Array ( [Basic Latin] => 89 [Latin Extended-A] => 4 ) lithuanian lt lit Supported languages: Array ( [0] => albanian [1] => arabic [2] => azeri [3] => bengali [4] => bulgarian [5] => cebuano [6] => croatian [7] => czech [8] => danish [9] => dutch [10] => english [11] => estonian [12] => farsi [13] => finnish [14] => french [15] => german [16] => hausa [17] => hawaiian [18] => hindi [19] => hungarian [20] => icelandic [21] => indonesian [22] => italian [23] => kazakh [24] => kyrgyz [25] => latin [26] => latvian [27] => lithuanian [28] => macedonian [29] => mongolian [30] => nepali [31] => norwegian [32] => pashto [33] => pidgin [34] => polish [35] => portuguese [36] => romanian [37] => russian [38] => serbian [39] => slovak [40] => slovene [41] => somali [42] => spanish [43] => swahili [44] => swedish [45] => tagalog [46] => turkish [47] => ukrainian [48] => urdu [49] => uzbek [50] => vietnamese [51] => welsh ) 52
Another example recognizes the page language:
<? header('Content-Type: text/plain; charset=utf-8'); require_once 'Text/LanguageDetect.php'; $l = new Text_LanguageDetect(); mb_internal_encoding("UTF-8"); //example content page $url = "http://lt.wikipedia.org/wiki/Kalba"; $page = file_get_contents($url); //parse page charset preg_match('/<meta[^>]+charset=[\'"]*([a-z0-9\-]+)[\'"]*/i', $page, $a); print_r($a); if(!$a){ $charset = "UTF-8"; }else{ $charset = strtoupper($a[1]); } //remove whitespace, html tags and javascript from page content $search = array('#<script[^>]*?>.*?</script>#si', // Strip out javascript '#<style[^>]*?>.*?</style>#siU', // Strip style tags properly '#<[\/\!]*?[^<>]*?>#si', // Strip out HTML tags '#<![\s\S]*?--[ \t\n\r]*>#', // Strip multi-line comments including CDATA '#\s\s+#' // Strip whitespace ); $content = preg_replace($search, '', $page); //First 200 simbols of text content should be enough for language detection $content = mb_substr($content, 0, 200); //convert to utf-8 encoding if necessary if($charset != "UTF-8"){ $content = iconv($charset, "UTF-8", $content); } //Output content echo $content."\n"; //language name $l->setNameMode(2); echo $l->detectSimple($content)."\n"; //closeness languages $result = $l->detect($content, 4); print_r($result); //distribution of unicode blocks $blocks = $l->detectUnicodeBlocks($content, true); print_r($blocks);
Output:
Array ( [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" [1] => UTF-8 ) Kalba – VikipedijaKalbaStraipsnis iš Vikipedijos, laisvosios enciklopedijos.Peršokti į: navigaciją,paieškąVikisritis: KalbosKalba – lingvistinių ženklų sistema. Svarbiausia kalbos paskirtis – būti žmo lt Array ( [lt] => 0.23644295302013 [lv] => 0.14548098434004 [et] => 0.14234899328859 [la] => 0.12302013422819 ) Array ( [Basic Latin] => 161 [General Punctuation] => 3 [Latin Extended-A] => 11 )
Download text language detection source code.
thanks for this code !!
Hello,
Is there a way to detect charset encoding (not only language)?
The matter is that (f.e) for Russian language there may be several (windows-1251, koi-8r, utf-8,...) encodings.
So, we need to know exact charset encoding to convert it by iconv.
Thank you.
Hello,
I just extract charset from meta tag, look at line 14 in second example. If page has no charset then it is necessary to analyze the text. For example I found this function for cyrillic text encoding detection http://forum.dklab.ru/viewtopic.php?t=37830
Thank you Aleksandras,
I know about charset meta tag, but sometimes webmasters either not place it or even charset meta is not equal to real encoding.
Your link will help with Russian encoding (thanks for it 🙂 ), but I am developing Russian and English services, and for English one I must detect all languages.
As I know, there could be several encoding not only for Russian, but probably for some (f.e. Chinese) other languages.
How detect these encodings?
I don't know anything about charset detection, but there is some article I found about it http://en.wikipedia.org/wiki/Charset_detection