Conference paper

Offering language based services on social media by identifying user's preferred language(s) from romanized text


With the increase of multilingual content and multilingual users on the web, it is prudent to offer personalized services and ads to users based on their language profile (i.e., the list of languages that a user is conversant with). Identifying the language profile of a user is often non-trivial because (i) users often do not specify all the languages known to them while signing up for an online service (ii) users of many languages (especially Indian languages) largely use Latin/Roman script to write content in their native language. This makes it non-trivial for a machine to distin- guish the language of one comment from another. This situ- ation presents an opportunity for offering following language based services for romanized content (i) hide romanized com- ments which belong to a language which is not known to the user (ii) translate romanized comments which belong to a language which is not known to the user (iii) transliterate romanized comments which belong to a language which is known to the user (iv) show language based ads by iden- tifying languages known to a user based on the romanized comments that he wrote/read/liked. We first use a simple bootstrapping based semi-supervised algorithm for identify the language of a romanized comment. We then apply this algorithm to all the comments written/read/liked by a user to build a language profile of the user and propose that this profile can be used to offer the services mentioned above.
