{"id":519,"date":"2012-02-17T11:04:21","date_gmt":"2012-02-17T09:04:21","guid":{"rendered":"https:\/\/polyetilen.lt\/?p=519"},"modified":"2023-05-02T10:09:51","modified_gmt":"2023-05-02T07:09:51","slug":"text-language-detection-with-php","status":"publish","type":"post","link":"https:\/\/polyetilen.lt\/en\/text-language-detection-with-php","title":{"rendered":"Text language detection with php"},"content":{"rendered":"<p>Previously was Google Language API for language detection but it is now paid. I found an alternative way to detect the language of text using <a href=\"http:\/\/pear.php.net\/package\/Text_LanguageDetect\">Text_LanguageDetect pear package<\/a> with 52 supported languages. Here is lithuanian text language detection example with list of supported languages:<\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\n&lt;?\r\nheader('Content-Type: text\/plain; charset=utf-8');\r\n\r\nrequire_once 'Text\/LanguageDetect.php';\r\n$l = new Text_LanguageDetect();\r\n\r\n\/\/example text for language detection\r\n$text = 'O mergina, vienplauk\u0117, palaidomis kasomis, atlapu kaklu, smulkiu, bet tvirtu \u017eingsniu \u0117jo toliau, artyn prie e\u017eero.';\r\n\r\n\/\/Detects the closeness of a sample of text to the known languages\r\n$result = $l-&gt;detect($text, 4);\r\nprint_r($result);\r\n\r\n\/\/Returns the distribution of unicode blocks in a given utf8 string\r\n$blocks = $l-&gt;detectUnicodeBlocks($text, true);\r\nprint_r($blocks);\r\n\r\n\/\/language name\r\n$l-&gt;setNameMode(0);\r\necho $l-&gt;detectSimple($text).&quot;\\n&quot;;\r\n\r\n\/\/ISO 639-1 two-letter language code\r\n$l-&gt;setNameMode(2);\r\necho $l-&gt;detectSimple($text).&quot;\\n&quot;;\r\n\r\n\/\/ISO 639-2 three-letter language code\r\n$l-&gt;setNameMode(3);\r\necho $l-&gt;detectSimple($text).&quot;\\n&quot;;\r\n\r\n\/\/Supported languages list\r\n$l-&gt;setNameMode(0);\r\necho &quot;Supported languages:\\n&quot;;\r\n$langs = $l-&gt;getLanguages();\r\nsort($langs);\r\nprint_r($langs);\r\n\r\n\/\/Total amount of supported languages\r\necho count($langs);\r\n<\/pre>\n<p>Output:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nArray\r\n(\r\n    &#x5B;lithuanian] =&gt; 0.24584192439863\r\n    &#x5B;latvian] =&gt; 0.19567010309278\r\n    &#x5B;estonian] =&gt; 0.11316151202749\r\n    &#x5B;dutch] =&gt; 0.11240549828179\r\n)\r\nArray\r\n(\r\n    &#x5B;Basic Latin] =&gt; 89\r\n    &#x5B;Latin Extended-A] =&gt; 4\r\n)\r\nlithuanian\r\nlt\r\nlit\r\nSupported languages:\r\nArray\r\n(\r\n    &#x5B;0] =&gt; albanian\r\n    &#x5B;1] =&gt; arabic\r\n    &#x5B;2] =&gt; azeri\r\n    &#x5B;3] =&gt; bengali\r\n    &#x5B;4] =&gt; bulgarian\r\n    &#x5B;5] =&gt; cebuano\r\n    &#x5B;6] =&gt; croatian\r\n    &#x5B;7] =&gt; czech\r\n    &#x5B;8] =&gt; danish\r\n    &#x5B;9] =&gt; dutch\r\n    &#x5B;10] =&gt; english\r\n    &#x5B;11] =&gt; estonian\r\n    &#x5B;12] =&gt; farsi\r\n    &#x5B;13] =&gt; finnish\r\n    &#x5B;14] =&gt; french\r\n    &#x5B;15] =&gt; german\r\n    &#x5B;16] =&gt; hausa\r\n    &#x5B;17] =&gt; hawaiian\r\n    &#x5B;18] =&gt; hindi\r\n    &#x5B;19] =&gt; hungarian\r\n    &#x5B;20] =&gt; icelandic\r\n    &#x5B;21] =&gt; indonesian\r\n    &#x5B;22] =&gt; italian\r\n    &#x5B;23] =&gt; kazakh\r\n    &#x5B;24] =&gt; kyrgyz\r\n    &#x5B;25] =&gt; latin\r\n    &#x5B;26] =&gt; latvian\r\n    &#x5B;27] =&gt; lithuanian\r\n    &#x5B;28] =&gt; macedonian\r\n    &#x5B;29] =&gt; mongolian\r\n    &#x5B;30] =&gt; nepali\r\n    &#x5B;31] =&gt; norwegian\r\n    &#x5B;32] =&gt; pashto\r\n    &#x5B;33] =&gt; pidgin\r\n    &#x5B;34] =&gt; polish\r\n    &#x5B;35] =&gt; portuguese\r\n    &#x5B;36] =&gt; romanian\r\n    &#x5B;37] =&gt; russian\r\n    &#x5B;38] =&gt; serbian\r\n    &#x5B;39] =&gt; slovak\r\n    &#x5B;40] =&gt; slovene\r\n    &#x5B;41] =&gt; somali\r\n    &#x5B;42] =&gt; spanish\r\n    &#x5B;43] =&gt; swahili\r\n    &#x5B;44] =&gt; swedish\r\n    &#x5B;45] =&gt; tagalog\r\n    &#x5B;46] =&gt; turkish\r\n    &#x5B;47] =&gt; ukrainian\r\n    &#x5B;48] =&gt; urdu\r\n    &#x5B;49] =&gt; uzbek\r\n    &#x5B;50] =&gt; vietnamese\r\n    &#x5B;51] =&gt; welsh\r\n)\r\n52\r\n<\/pre>\n<p>Another example recognizes the page language:<\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\n&lt;?\r\nheader('Content-Type: text\/plain; charset=utf-8');\r\n\r\nrequire_once 'Text\/LanguageDetect.php';\r\n$l = new Text_LanguageDetect();\r\n\r\nmb_internal_encoding(&quot;UTF-8&quot;);\r\n\r\n\/\/example content page\r\n$url = &quot;http:\/\/lt.wikipedia.org\/wiki\/Kalba&quot;;\r\n$page = file_get_contents($url);\r\n\r\n\/\/parse page charset\r\npreg_match('\/&lt;meta&#x5B;^&gt;]+charset=&#x5B;\\'&quot;]*(&#x5B;a-z0-9\\-]+)&#x5B;\\'&quot;]*\/i', $page, $a);\r\nprint_r($a);\r\n\r\nif(!$a){\r\n\t$charset = &quot;UTF-8&quot;;\r\n}else{\r\n\t$charset = strtoupper($a&#x5B;1]);\r\n}\r\n\r\n\/\/remove whitespace, html tags and javascript from page content\r\n$search = array('#&lt;script&#x5B;^&gt;]*?&gt;.*?&lt;\/script&gt;#si',\t\/\/ Strip out javascript\r\n\t\t'#&lt;style&#x5B;^&gt;]*?&gt;.*?&lt;\/style&gt;#siU',\t\t\t\/\/ Strip style tags properly\r\n\t\t'#&lt;&#x5B;\\\/\\!]*?&#x5B;^&lt;&gt;]*?&gt;#si',\t\t\t\t\t\/\/ Strip out HTML tags\r\n\t\t'#&lt;!&#x5B;\\s\\S]*?--&#x5B; \\t\\n\\r]*&gt;#',\t\t\t\t\/\/ Strip multi-line comments including CDATA\r\n\t\t'#\\s\\s+#'\t\t\t\t\t\t\t\t\t\/\/ Strip whitespace\r\n);\r\n$content = preg_replace($search, '', $page);\r\n\r\n\/\/First 200 simbols of text content should be enough for language detection\r\n$content = mb_substr($content, 0, 200);\r\n\r\n\/\/convert to utf-8 encoding if necessary\r\nif($charset != &quot;UTF-8&quot;){\r\n\t$content = iconv($charset, &quot;UTF-8&quot;, $content);\r\n}\r\n\r\n\/\/Output content\r\necho $content.&quot;\\n&quot;;\r\n\r\n\/\/language name\r\n$l-&gt;setNameMode(2);\r\necho $l-&gt;detectSimple($content).&quot;\\n&quot;;\r\n\r\n\/\/closeness languages\r\n$result = $l-&gt;detect($content, 4);\r\nprint_r($result);\r\n\r\n\/\/distribution of unicode blocks\r\n$blocks = $l-&gt;detectUnicodeBlocks($content, true);\r\nprint_r($blocks);\r\n<\/pre>\n<p>Output:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nArray\r\n(\r\n    &#x5B;0] =&gt; &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text\/html; charset=UTF-8&quot;\r\n    &#x5B;1] =&gt; UTF-8\r\n)\r\nKalba \u2013 VikipedijaKalbaStraipsnis i\u0161 Vikipedijos, laisvosios enciklopedijos.Per\u0161okti \u012f: navigacij\u0105,paie\u0161k\u0105Vikisritis: KalbosKalba \u2013 lingvistini\u0173 \u017eenkl\u0173 sistema.\r\nSvarbiausia kalbos paskirtis \u2013 b\u016bti \u017emo\r\nlt\r\nArray\r\n(\r\n    &#x5B;lt] =&gt; 0.23644295302013\r\n    &#x5B;lv] =&gt; 0.14548098434004\r\n    &#x5B;et] =&gt; 0.14234899328859\r\n    &#x5B;la] =&gt; 0.12302013422819\r\n)\r\nArray\r\n(\r\n    &#x5B;Basic Latin] =&gt; 161\r\n    &#x5B;General Punctuation] =&gt; 3\r\n    &#x5B;Latin Extended-A] =&gt; 11\r\n)\r\n<\/pre>\n<p>Download <a href=\"https:\/\/polyetilen.lt\/wp-content\/uploads\/2012\/02\/detect_language.7z\">text language detection source code<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Previously was Google Language API for language detection but it is now paid. I found an alternative way to detect the language of text using Text_LanguageDetect pear package with 52 supported languages. Here is lithuanian &hellip; <a href=\"https:\/\/polyetilen.lt\/en\/text-language-detection-with-php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_locale":"en_US","_original_post":"http:\/\/polyetilen.lt\/?p=262","footnotes":""},"categories":[8],"tags":[87,34],"class_list":["post-519","post","type-post","status-publish","format-standard","hentry","category-programavimas","tag-kalba","tag-php","en-US"],"_links":{"self":[{"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/posts\/519","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/comments?post=519"}],"version-history":[{"count":2,"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/posts\/519\/revisions"}],"predecessor-version":[{"id":521,"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/posts\/519\/revisions\/521"}],"wp:attachment":[{"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/media?parent=519"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/categories?post=519"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/polyetilen.lt\/wp-json\/wp\/v2\/tags?post=519"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}