「利用者‐会話:青子守歌/ログ10」の版間の差分

削除された内容 追加された内容
83行目:
:In the English language we delimiter words by spaces which isn't a good strategy for Japanese as far as I can tell. For Japanese our strategy is to treat each character as a word. If you have a different suggestion we will do our best to try to implement it. Indeed with my very limited understanding of Japanese I am aware it is more customary to have pairs or triples of Kanji. The generated list are kanji that statistically appear on reverted edits but not on regular edits. For this we use a TF-IDF approach. Some English curse words are made out of two or more words. "God Damn", "Fuck You" "Fuck Off" etc would be three examples. Words "God", "You" and "Off" would not normally be considered curse words as such our statistical approach would not treat them as such where as we would treat "Damn" and "Fuck" as curse words. Likewise we are trying to identify the Kanji that appear commonly in Japanese curse words even if they are not exclusively used in curses.
:There also are words that are reverted in articles but not on talk pages. In English this would include words like "hello" or "hahaha". Which Kanji would be informal like this?
:The idea here is to let the machine learning algorithm decide what to do with these words. Our approach relies on more features than just these word lists.
:--<small> [[User:とある白い猫|とある白い猫]]</small> <sup>[[User talk:とある白い猫|ちぃ?]]</sup> 2015年11月14日 (土) 11:40 (UTC)
「青子守歌/ログ10」の利用者ページに戻る。