Wednesday, June 26, 2013

Code for converting Tamil Virtual University's Kamabaramayanam into E book format

The Tamil Epic Kambaramayanam is not available as e book format. However it is available in Tamil Virtual University as a HTML pages. Friend of mine, Iyyappan who wanted to study Kamabaramayan thought if that available as a e book format preferably mobile, it would be helpful. He asked me to lend a helping hand.

I downloaded the html pages from the TVU site. I used python http-requests to download. After downloading, it comes the big issue. These html pages and content are not organized, there is no id, structure to convert the content into desired format. So initially I tried to convert, gave up, just formatted html using BeautifulSoup and combined all the pages. So now the content is in e book format but not good for reading.

Here is the code for removing the extra html elements

print ''
print ''
print ''

for i in range(1,796+1):
     inputfilename='filename'+i+'.html'
     data=urllib2.urlopen(inputfilename)
     soup = BeautifulSoup(data)
     data=soup.prettify()
     soup = BeautifulSoup(data)
     ti=soup.findAll(attrs={'class':'link'})
     for t in ti:
          t.extract()
     ti=soup.findAll(attrs={'class':'thead'})
     for t in ti:
          t.extract()    
     hr=soup.findAll('hr')
     for h in hr:
          h.extract()
     hr=soup.findAll('form')
     for h in hr:
          h.extract()
     hr=soup.findAll('head')
     for h in hr:
          h.extract()
     hr=soup.findAll('input')
     for h in hr:
          h.extract()    
     print soup.prettify()
print '
'print '
'print '
'   
Now, need to find a way to format this code. So, I thought instead of trying to wrestle this content inside, html tags, just extract the content convert into text, then process it.

So, I removed the remaining unwanted text such as page numbers, headings and extracted the text using the following code

inputfilename='file.html'
data=urllib2.urlopen(inputfilename)
soup = BeautifulSoup(data)
data=soup.prettify()
soup = BeautifulSoup(data)
ti=soup.findAll(attrs={'class':'pno'})
for t in ti:
     t.extract()    
ti=soup.findAll(attrs={'class':'subhead'})
for t in ti:
     t.extract()    
lines=[]
for s in soup(text=True):
     #print s
     s=s.strip().replace('\t','')
     #if s!='.':
     #     print s
     print s

Then I found a pattern in the texts to get that converted. I used the poem numbers to find poems. So if the poem numbers starting at 3000, next poem is at 3001 so on and so forth. Next the explanation of the poems are next to the poem that would be easy to find out too. Removing the newlines and formatting the poem is easy too. Now, I think, it could be done if the content in the html too, but I didnt see that pattern in HTML. I could try to do that from remaining chapters.

Find the code below,

A=open('file.txt')
E=A.readlines()
A.close()

startcount=6060
emptyline=0
state=True
value=True
linevalue=0
i=0
isurai=True
for line in range(0,len(E)):
     #print E[line].strip()
     p = re.compile(r'\d')
     if p.match(E[line].replace('.','')) and E[line].replace('.','').find(str(startcount))!=-1:
          print E[line],
          #print E[line+1],
          #print E[line+2],
          if not isurai:
               break
          for i in range(1,10,2):
               if E[line+1+i]=='\n':
                    break
               if E[line+2+i].startswith('    '):
                    #print line,'a',i,i+1
                    print E[line+1+i].replace('    ','').replace('\n','')+E[line+2+i].replace('    ',''),
               else:
                    #print line,i,i+1
                    print E[line+i+1].replace('    ','').replace('\n','')
                    print E[line+i+2].replace('    ','').replace('\n','')
          if i<4: p="">               break
          #print 'iiiiiiiiii',i
          startcount=startcount+1
          isurai=False
          print '\n',
     else:
          printline=''
          printline1=''
          if i==9 or i==5:
               isurai=True
               for j in range(i,150):
                    if p.match(E[line+j].replace('.','')) and E[line+j].replace('.','').find(str(startcount))!=-1:
                         break
                    #if p.match(E[line+j]):
                    #     printline1=E[line+j+1].replace(' ','')
                    #else:    
                    printline=printline+' '+E[line+j].replace('\n',' ').replace(' ','')
               print printline.strip(),'\n\n',
               #print printline1.strip(),
               i=0

After this need to check manually the missing the poems are content, also check the poem numbers. Because it is manually typed and not proof read, poem numbers are not in order. I had to manually change that re run the script everytime.

Another manual work is to separate the sub headings from the explanations. Now the code to convert the txt to html format along with table of contents.

A=open('file.txt')
E=A.readlines()
A.close()
startcount=4740
print """




"""
co=[]
for e in range(0,len(E)):
     if E[e].find('\t')!=-1 and E[e].find('(')==-1:
          if E[e].find('.')!=-1:
               print '      '
               print E[e]
               print '
'
               print ' '
               print E[e+2]
               print '
'
               cc=''+E[e].strip()+''
               co.append(cc)
               continue
          else:
               print ' '
               print E[e]
               print '
'
               cc=''+E[e].strip()+''
               co.append(cc)
               continue
     else:         
          p = re.compile(r'\d')
          #if p.match(E[e].replace('.','')):
          #     print E[e]
          #print E[e].replace('.','').find(str(startcount))!=-1
          if p.match(E[e].replace('.','')) and E[e].replace('.','').find(str(startcount))!=-1:# :
               #print '-----------'
               print ' '
               print ''
               print ''

               print ' '
               print ''
               print ''
               print E[e]
               print '
'
               print '
'               print ''
               for i in range(1,20):
                    if E[e+i]!='\n':
                         print E[e+i]+''
                    else:
                         break
               print '
'
               print '
'               print '
'               print '
'               print '
'               print '
'               print ' '
               print ''
               print ''
               if E[e+i+1].find('\t')==-1 and len(E[e+i+1])<250: p="">                    print len(E[e+i+1])
                    break
               print E[e+i+1]+'
'
               i=0
               print '
'
               print '
'               print '
'               #print ' '
               startcount=startcount+1
print """              





"""
for cc in co:
     if len(cc)>28:
          if cc.find('padalam_sub_')!=-1:
               print '      ',cc,'
'    
          else:
               print '
',cc,'

'
print """





"""         
print """
    

     """Final manual work is move the TOC from bottom to top. Just cut and paste. All done.

Used python libs,

Requests: HTTP for Humans


Beautiful Soup


Find the formatted html in my public dropbox folder,

https://www.dropbox.com/sh/yy9lq619z299fp5/zqZuKIF74H

Download the file names contain _formatted

https://www.dropbox.com/sh/yy9lq619z299fp5/dAuP0o-KGF/arayanya_formatted.html
https://www.dropbox.com/sh/yy9lq619z299fp5/RkzrW3EiE3/ayodhya_formatted.html
https://www.dropbox.com/sh/yy9lq619z299fp5/hzZvSIdNzn/kikinada_formatted.html
https://www.dropbox.com/sh/yy9lq619z299fp5/aD_91lnW90/sundra_formatted.html


In the 7 chapters only four chapters are converted. Need to convert remaining three chapters. Hoping to finish that work in coming week.

ShareThis

raja's shared items

There was an error in this gadget

My "Testing" Bundle