Text language detection with php

Previously was Google Language API for language detection but it is now paid. I found an alternative way to detect the language of text using Text_LanguageDetect pear package with 52 supported languages. Here is lithuanian text language detection example with list of supported languages:

<?
header('Content-Type: text/plain; charset=utf-8');

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();

//example text for language detection
$text = 'O mergina, vienplaukė, palaidomis kasomis, atlapu kaklu, smulkiu, bet tvirtu žingsniu ėjo toliau, artyn prie ežero.';

//Detects the closeness of a sample of text to the known languages
$result = $l->detect($text, 4);
print_r($result);

//Returns the distribution of unicode blocks in a given utf8 string
$blocks = $l->detectUnicodeBlocks($text, true);
print_r($blocks);

//language name
$l->setNameMode(0);
echo $l->detectSimple($text)."\n";

//ISO 639-1 two-letter language code
$l->setNameMode(2);
echo $l->detectSimple($text)."\n";

//ISO 639-2 three-letter language code
$l->setNameMode(3);
echo $l->detectSimple($text)."\n";

//Supported languages list
$l->setNameMode(0);
echo "Supported languages:\n";
$langs = $l->getLanguages();
sort($langs);
print_r($langs);

//Total amount of supported languages
echo count($langs);

Output:

Array
(
    [lithuanian] => 0.24584192439863
    [latvian] => 0.19567010309278
    [estonian] => 0.11316151202749
    [dutch] => 0.11240549828179
)
Array
(
    [Basic Latin] => 89
    [Latin Extended-A] => 4
)
lithuanian
lt
lit
Supported languages:
Array
(
    [0] => albanian
    [1] => arabic
    [2] => azeri
    [3] => bengali
    [4] => bulgarian
    [5] => cebuano
    [6] => croatian
    [7] => czech
    [8] => danish
    [9] => dutch
    [10] => english
    [11] => estonian
    [12] => farsi
    [13] => finnish
    [14] => french
    [15] => german
    [16] => hausa
    [17] => hawaiian
    [18] => hindi
    [19] => hungarian
    [20] => icelandic
    [21] => indonesian
    [22] => italian
    [23] => kazakh
    [24] => kyrgyz
    [25] => latin
    [26] => latvian
    [27] => lithuanian
    [28] => macedonian
    [29] => mongolian
    [30] => nepali
    [31] => norwegian
    [32] => pashto
    [33] => pidgin
    [34] => polish
    [35] => portuguese
    [36] => romanian
    [37] => russian
    [38] => serbian
    [39] => slovak
    [40] => slovene
    [41] => somali
    [42] => spanish
    [43] => swahili
    [44] => swedish
    [45] => tagalog
    [46] => turkish
    [47] => ukrainian
    [48] => urdu
    [49] => uzbek
    [50] => vietnamese
    [51] => welsh
)
52

Another example recognizes the page language:

<?
header('Content-Type: text/plain; charset=utf-8');

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();

mb_internal_encoding("UTF-8");

//example content page
$url = "http://lt.wikipedia.org/wiki/Kalba";
$page = file_get_contents($url);

//parse page charset
preg_match('/<meta[^>]+charset=[\'"]*([a-z0-9\-]+)[\'"]*/i', $page, $a);
print_r($a);

if(!$a){
	$charset = "UTF-8";
}else{
	$charset = strtoupper($a[1]);
}

//remove whitespace, html tags and javascript from page content
$search = array('#<script[^>]*?>.*?</script>#si',	// Strip out javascript
		'#<style[^>]*?>.*?</style>#siU',			// Strip style tags properly
		'#<[\/\!]*?[^<>]*?>#si',					// Strip out HTML tags
		'#<![\s\S]*?--[ \t\n\r]*>#',				// Strip multi-line comments including CDATA
		'#\s\s+#'									// Strip whitespace
);
$content = preg_replace($search, '', $page);

//First 200 simbols of text content should be enough for language detection
$content = mb_substr($content, 0, 200);

//convert to utf-8 encoding if necessary
if($charset != "UTF-8"){
	$content = iconv($charset, "UTF-8", $content);
}

//Output content
echo $content."\n";

//language name
$l->setNameMode(2);
echo $l->detectSimple($content)."\n";

//closeness languages
$result = $l->detect($content, 4);
print_r($result);

//distribution of unicode blocks
$blocks = $l->detectUnicodeBlocks($content, true);
print_r($blocks);

Output:

Array
(
    [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
    [1] => UTF-8
)
Kalba – VikipedijaKalbaStraipsnis iš Vikipedijos, laisvosios enciklopedijos.Peršokti į: navigaciją,paieškąVikisritis: KalbosKalba – lingvistinių ženklų sistema.
Svarbiausia kalbos paskirtis – būti žmo
lt
Array
(
    [lt] => 0.23644295302013
    [lv] => 0.14548098434004
    [et] => 0.14234899328859
    [la] => 0.12302013422819
)
Array
(
    [Basic Latin] => 161
    [General Punctuation] => 3
    [Latin Extended-A] => 11
)

Download text language detection source code.

This entry was posted in Programming and tagged , .

5 Responses to Text language detection with php

  1. thanks for this code !!

  2. Igor

    Hello,
    Is there a way to detect charset encoding (not only language)?
    The matter is that (f.e) for Russian language there may be several (windows-1251, koi-8r, utf-8,...) encodings.
    So, we need to know exact charset encoding to convert it by iconv.

    Thank you.

    • Hello,
      I just extract charset from meta tag, look at line 14 in second example. If page has no charset then it is necessary to analyze the text. For example I found this function for cyrillic text encoding detection http://forum.dklab.ru/viewtopic.php?t=37830

      • Igor

        Thank you Aleksandras,

        I know about charset meta tag, but sometimes webmasters either not place it or even charset meta is not equal to real encoding.
        Your link will help with Russian encoding (thanks for it 🙂 ), but I am developing Russian and English services, and for English one I must detect all languages.

        As I know, there could be several encoding not only for Russian, but probably for some (f.e. Chinese) other languages.
        How detect these encodings?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.