R - XML of natural language corpus into dataframe -


i handling xml file in r using xml package. final goal create dataframe containing following information.

luwpos luwdictionaryform luwlemma orthographictranscription phonetictranscription plainorthographictranscription devoiced moraid toneclass moraid 動詞 ダイスル 題する 題し ダイシ 題し 1 3 accent 1 

luwpos, luwdictionaryform, luwlemma atts of luw node. orthographictranscription, phonetictranscriptio, plainorthographictranscription in suw, daughter of luw. devoiced in phone node, descendant of suw. moraid att of mora node, grandmother of phone. toneclass attribute of node xjtobilabeltone, descendant of phone. second moraid closest ancestor of xjtobilabeltone containing toneclass=accent. not phone nodes contain att devoiced. in case, don't need first moraid. when xjtobilabeltone not contain toneclass="accent", don't need second moraid either.

so far, following:

doc= xmlinternaltreeparse(file="a01f0122.xml") #opens file     luw <- xpathsapply(doc, "//luw", xmlattrs) #extracts attributes of node luw     df <- data.frame(reduce(rbind, luw)) #creates dataframe 

it gave me following output.

luwid luwpos isnewline lineid luwdictionaryform     luwlemma luwmiscposinfo1 19     2   名詞         1    002          ホンジツ         本日               2 20     3   名詞         1    003    ハッピョウシャ       発表者               3 21     4   助詞         0    003                ノ           の          格助詞 22     5   名詞         1    004              ××××           ××        固有名詞 23     6   名詞         1    005        キュウヨウ         急用               6 24     7   助詞         0    005      ニツキマシテ につきまして          格助詞 25     1   名詞         1    001          ケッセキ         欠席               1 26     2 助動詞         0    001      デゴザイマス でございます          連用形 27     3   助詞         0    001                テ           て        接続助詞 28     4   名詞         1    002            カワリ       代わり               4 29     5   助詞         0    002                ニ           に          格助詞 30     6 代名詞         1    003          ワタクシ           私               6 

it contains of information want, don't know how descendants of luw.

<?xml version="1.0" encoding="utf-8"?> <talk talkid="a01f0122" speakerid="463" speakerbirthplace="神奈川県" speakerbirthgeneration="70to74" speakersex="女">   <talkcomment>     <comment commentstrings="講演id:a01f0122"/>     <comment commentstrings=""/>     <comment commentstrings=""/>   </talkcomment>   <ipu ipuid="0001" ipustarttime="00000.312" ipuendtime="00001.973" channel="l">     <luw luwid="9" luwpos="動詞" isnewline="1" lineid="006" luwdictionaryform="ダイスル" luwlemma="題する" luwconjugatetype="サ行変格" luwconjugateform="連用形">       <suw suwid="1" columnid="001" suwdictionaryform="ダイスル" suwlemma="題する" suwconjugateform="連用形" suwconjugatetype="サ行変格" suwconjugateform2="連用形" suwconjugatetype2="サ行変格" suwpos="動詞" orthographictranscription="題し" phonetictranscription="ダイシ" plainorthographictranscription="題し" apid="7" dep_bunsetsuunitid="6" dep_modifieebunsetsuunitid="7">         <transsuw transsuwid="1">           <mora moraentity="ダ" moraid="1" perceivedacc="1">             <phoneme phonemeentity="d" phonemeid="1">               <phone phoneid="1" phoneentity="scls" phoneclass="others" phonestarttime="6.188682" phoneendtime="6.19458"/>               <phone phoneid="2" phoneentity="d" phoneclass="consonant" phonestarttime="6.19458" phoneendtime="6.207031"/>             </phoneme>             <phoneme phonemeentity="a" phonemeid="2">               <phone phoneid="1" phoneentity="a" phoneclass="vowel" phonestarttime="6.207031" phoneendtime="6.317124">                 <xjtobilabeltone time="6.212447" f0="209.865" toneclass="ibt">%l</xjtobilabeltone>                 <xjtobilabeltone time="6.275146" f0="195.496" toneclass="accent">a</xjtobilabeltone>               </phone>             </phoneme>           </mora>           <mora moraentity="イ" moraid="2">             <phoneme phonemeentity="i" phonemeid="1">               <phone phoneid="1" phoneentity="i" phoneclass="vowel" phonestarttime="6.317124" phoneendtime="6.361029"/>             </phoneme>           </mora>           <mora moraentity="シ" moraid="3">             <phoneme phonemeentity="sj" phonemeid="1">               <phone phoneid="1" phoneentity="sj" phoneclass="consonant" phonestarttime="6.361029" phoneendtime="6.406245" endtimeuncertain="1"/>             </phoneme>             <phoneme phonemeentity="i" phonemeid="2">               <phone phoneid="1" phoneentity="i" phoneclass="vowel" devoiced="1" phonestarttime="6.406245" phoneendtime="6.451461" starttimeuncertain="1">                 <xjtobilabelword time="6.451461" perceivedaccpos="1">daisji</xjtobilabelword>                 <xjtobilabelbreak time="6.451461">1</xjtobilabelbreak>               </phone>             </phoneme>           </mora>         </transsuw>       </suw>     </luw>     <luw luwid="10" luwpos="助詞" isnewline="0" lineid="006" luwdictionaryform="テ" luwlemma="て" luwmiscposinfo1="接続助詞">       <suw suwid="1" columnid="005" suwdictionaryform="テ" suwlemma="て" suwmiscposinfo1="接続助詞" suwpos="助詞" orthographictranscription="て" phonetictranscription="テ" plainorthographictranscription="て" apid="7">         <transsuw transsuwid="1">           <mora moraentity="テ" moraid="1">             <phoneme phonemeentity="t" phonemeid="1">               <phone phoneid="1" phoneentity="scls" phoneclass="others" phonestarttime="6.451461" phoneendtime="6.484228">                 <xjtobilabeltone time="6.451887" toneclass="ltbpm" f0uncertain="1">l%</xjtobilabeltone>               </phone>               <phone phoneid="2" phoneentity="t" phoneclass="consonant" phonestarttime="6.484228" phoneendtime="6.497334"/>             </phoneme>             <phoneme phonemeentity="e" phonemeid="2">               <phone phoneid="1" phoneentity="e" phoneclass="vowel" phonestarttime="6.497334" phoneendtime="6.565485">                 <xjtobilabeltone time="6.536170" f0="245.046" toneclass="pointer">ph</xjtobilabeltone>                 <xjtobilabelword time="6.565485" perceivedaccpos="0">te</xjtobilabelword>                 <xjtobilabelbreak time="6.565485">1</xjtobilabelbreak>               </phone>             </phoneme>           </mora>         </transsuw>       </suw>     </luw>   </ipu> </talk> 

(this not solution guidance 1 way proceed. pasting code in comments less optimal in so)

you should consider going list manipulation route vs xml paths route.

# xml doc list of nested lists doc.list <- xmltolist(doc) # inspect str(doc.list)  # luw nested list make easier process luw.list <- dl$ipu$luw # inspect str(luw.list)  # @ attributes str(luw.list$.attrs)  # inspect sum node str(luw.list$suw) 

once feel structure should able use various *apply or dply functions extract need.


Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -