ASP Developing for Academic and Business Processes

Monday Jan 25th 1999 by ServerWatch Staff
Share:

Utilizing ASP in Systems Development Case: Develop a Research Library Website from Scanned Input

T.Mallard




This ASP solution to a common problem will demonstrate the relative ease of developing a working system this way. The problem is to take archived academic publications, scan and OCR them, edit the scanned data into text files as the first electronic data form.


An Example of Input Text.

This is processed by ASP to create text files, database tables and html files for online viewing. There are many ways of dealing with historic data, but most require indexing which translates the the written page numbers into hyperlinks, hopefully some keywords for searching, some way to tell if all the records were processed, validation of the data to the appropriate level, and creation of the html.

Converting Text Into Delimited Lists.


First step is hand editing the scanned data, removing scan errors and preparing the data for automation by inserting separation characters when needed. The data originates in original academic publications, out of print, and printed in small serif type so is the hardest to OCR anyway, in this case the printing quality was also bad. Whatever, after editing there are several types of files, general index, author index, article, bibliography. These are the basic files inputted to the ASP process.

Starting the flow...

<%@ language='vbscript' %>
<!-- # include virtual="/asp/adovbs.inc" -->
<% server.scripttimeout = 240 %>
<% buffers = true %>

This begins the server process in VBScript, sets the timeout because this takes about two minutes to run, includes the standard VBScript ADO related variable name file and sets the buffers to true so the page is parsed completely before the client gets a response. Since I don't output anything to the client they see a blank screen when it finishes. Instead a file is created with totals after the content files are created. There are a bunch of variables, declared next...

T.Mallard
The variables...
dim input, output, htmlOut, author, editor, pubdate, topic, source, locale, annotate, errs, err2
dim conn1, conn2, conn3, rsInput, txtOut, rsTemp, pagename, metatopics, secstart, minstart
dim metasOut

Then the first filename is set which is used for the individual topics which appear on each page, this list is created on the first pass with the original html file creation, I also create an alphabetic navigation table after the first run. These runs are 10-15 files each and will be updated infrequently, you'd automate this process for periodicals. It does build interactive links at run time, this image is saved in recordsets, and then written as the final output. The logic for this follows the basic creation sequence.

metasOut = "D:\Webshare\wwwroot\asp\csfa\csfa_metas.txt"
set conn2 = server.createobject("ADODB.Connection")
  conn2.open "DRIVER={Microsoft Access Driver (*.mdb)}; DBQ=D:\Webshare\wwwroot\asp\csfa\biblio_01.mdb"
  if not isobject(conn2)then
    errs2 = "unable to create access object"
    else errs2 = "access object created"
    end if
dim allfields, sql_01, sql_02
  sql_01 = sql_01 & "SELECT * from v3_86_index"
set rsInput = conn2.execute(sql_01)

rsInput.movefirst
  set conn1 = server.createobject("Scripting.FileSystemObject")
  if not isobject(conn1) then
    errs = "unable to create object"
  else errs = "file object created"
    end if

The OLEDB driver is being used here, the processes are all on one server. The tables were created by hand for this, it can be automated with initialization scripts, that just takes a bit longer to code. All fields are default 255 text datatype. In development, like this project, the initial phase is discovery, I tried single row architecture but abandoned it to relational tables as better right away. With larger projects you would automate the table creation process. The hookup to SQL server uses a similar connection string which includes username and password, and I also develop Oracle in the same manner on this machine. Above, the code has created connections and defined a basic select-all SQL statement and has the first record ready to process a page.

T.Mallard




With the connections open...



dim paging, aline, series, sNext, sPrev

while not rsInput.EOF
series = series + 1

The variable 'series' is a page counter, which begins with this while loop. Below is the code to create filenames and Next-Previous links.

for paging =1 to 80
if paging = 1 then
  if series < 100 then
    if series < 10 then
      htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_00" & series & ".html"
      sNext = series + 1
      if sNext = 9 then
        htmlNext = "csfa_010.html"
        else
        htmlNext = "csfa_00" & sNext & ".html"
      end if
    sPrev = series - 1
    htmlPrev = "csfa_00" & sPrev & ".html"
    else
    htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_0" & series & ".html"
    sNext = series + 1
    htmlNext = "csfa_0" & sNext & ".html"
    sPrev = series - 1
    if sPrev = 9 then
      htmlPrev = "csfa_009.html"
      else
      htmlPrev = "csfa_0" & sPrev & ".html"
    end if
    end if
  else
    htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_" & series & ".html"
    sNext = series + 1
    htmlNext = "csfa_" & sNext & ".html"
    if series = 1 then
      htmlPrev = "csfa_001.html"
      else
      sPrev = series - 1
      if sPrev = 99 then
        htmlPrev = "csfa_099.html"
        else
        htmlPrev = "csfa_" & sPrev & ".html"
      end if
    end if
  end if

Getting the Buttons Right.
This section of code above is mainly dealing with getting the automatic paging correct. As the records get processed, I use three digit auto naming so therefore must fill leading zeros for both the filename and for the Next-Previous buttons, so, it's a lot of code for a simple idea. The results are buttons that work and files that sort sequentially by filename. Next is the head portion of the page, which will contain two columns of 40 entries each.

T.Mallard
Begin outputting the html page to a file...




set output = conn1.opentextfile(htmlOut, 8, True)
output.writeline("<!-- DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2 Final//EN'-->")
output.writeline("<html>")
output.writeline("<head>")
output.writeline("<title>General Index - Volume 3, 1986 - Current Research in the Pleistocene</title>")
output.writeline("<meta name='Publisher' content='Center for the First Americans'>")
output.writeline("<meta name='Publisher-email' content='info@mallard.design.com'>")
output.writeline("<meta name='Keywords' content='*'>")
output.writeline("<meta name='Description' content='Current Research in the Pleistocene - Volume 3, 1986'>")
output.writeline("<meta name='Identifier-URL' content='http://www.>")
output.writeline("<meta name='Content-Language' content='en-US '>")
output.writeline("<meta name='Coverage' content='Worldwide'>")
output.writeline("<meta name='Date-created-yyyymmdd' content='19990115'>")
output.writeline("<meta name='Date-Revised-yyyymmdd' content='*'>")
output.writeline("<script language=JavaScript>")
output.writeline("<!-- ")
output.writeline("//-->")
output.writeline("</script>")
output.writeline("<style type='css/text'>")
output.writeline("<!--")
output.writeline(".roll {")
output.writeline("font-family:Arial;")
output.writeline("font-size:14pt;")
output.writeline("text-decoration:none;")
output.writeline("color:black;")
output.writeline("}")
output.writeline("//-->")
output.writeline("</style>")
output.writeline("</head>")
output.writeline("<body bgcolor=ffffff link=0000ff vlink=8e2323 alink=00009c>")
output.writeline("<basefont face='verdana,arial,helvetica' size=2 color=000000>")
output.writeline("<font size=4 color=800000 face='Arial,Helvetica,Verdana'>Current Research in the Pleistocene<br>")
output.writeline("Volume 3, 1986</font><br>")
output.writeline("<font size=3 color=800000 face='Arial,Helvetica,Verdana'><b>General Index</b></font><br>")
output.writeline("<font size=2 color=800000 face='Arial,Helvetica,Verdana'><b>Page " & series & "</b></font><br>")
output.writeline("<table><tr><td>")
end if

T.Mallard
Next is the heart of the loop for data, when the line count gets to 41, it inserts the coding to begin a new column or falls into the standard loop and outputs a line. It also adds the topic to the meta tags list on the way through the loop. As with most coding, the end conditions are what take all the work. Each field is trimmed as it's used, and the string is output. For dynamic content creating a table won't allow the client to see anything until the table is closed, but here it's being used for file creation so is not important.

if paging = 41 then
    output.writeline("</td><td>")
  else
    if trim(rsInput.Fields("topic")) <> "" then
      metatopics = metatopics & trim(rsInput.Fields("topic")) & ";"
    end if
  aline = trim(rsInput.Fields("topic")) & "  " & trim(rsInput.Fields("subtopic")) & "  "
  if trim(rsInput.Fields("pages")) <> "" then
  aline = aline & "-- " & trim(rsInput.Fields("pages")) & "<br>"
  else
  aline = aline & "<br>"
  end if
  output.writeline(aline)
  aline = ""
end if
rsInput.movenext

And then finishing the page after it runs out of input...

if rsInput.EOF then
  paging = 81
    output.writeline("</td></tr></table>")
    output.writeline("<p><center><table cellpadding=4 cellspacing=2 width='10%'><tr><td align=center bgcolor=silver><a href='" & htmlPrev & "'><font face='Comic Sans MS' size='4'>Previous</font></a></td><td align=center bgcolor=silver><a href='csfa_001.asp'><font face='Comic Sans MS' size='4'>Start</font></a></td>")
    output.writeline("</tr></table></center><p>")
    output.writeline("<font size=1>Dynamic Content Resources by<br><a href='http://www.mallard-design.com/'>Mallard Design Company</a></font>")
    output.writeline("</body>")
    output.writeline("</html>")
    output.close
end if

next

At this point, the page is full and the meta tag list is ready to output to a file for adding to the page on the second pass. The was a conscious choice, in other cases it would be better to store the page and add the list as final output. In this case a lot of the entries have commas, which need replacement before the topic can be used as a meta tag keyword. It's simpler here to open the finished file and globally change the entries into meta tag format and then paste them into the page.

T.Mallard
Once the page is full, the meta tag data is ready to save, this opens the file and writes the string which was added to during page construction to the file.



  set conn3 = server.createobject("Scripting.FileSystemObject")
  if not isobject(conn1) then
    errs = "unable to create object"
  else errs = "file object created"
    end if
set txtOut = conn3.OpenTextFile(metasOut, 8, True)
txtOut.writeline(metatopics)
txtOut.close
metatopics = ""

Then a check for end-of-file before a normal finish to the page with the Next-Previous buttons...

if not rsInput.EOF then
output.writeline("</td></tr></table>")
if series > 1 then
output.writeline("<p><center><table cellpadding=4 cellspacing=2 width='10%'><tr><td align=center bgcolor=silver><a href='" & htmlPrev & "'><font face='Comic Sans MS' size='4'>Previous</font></a></td><td align=center bgcolor=silver><a href='"& htmlNext & "'><font face='Comic Sans MS' size='4'>Next</font></a></td>")
else
output.writeline("<p><center><table cellpadding=4 cellspacing=2 width='10%'><tr><td align=center bgcolor=silver><a href='" & htmlNext & "'><font face='Comic Sans MS' size='4'>Next</font></a></td>")
end if
output.writeline("</tr></table></center><p>")
output.writeline("<font size=1>Dynamic Content Resources by<br><a href='http://www.mallard-design.com/'>Mallard Design Company</a></font>")
output.writeline("</body>")
output.writeline("</html>")
output.close
end if
wend

T.Mallard

So, the loop for paging is done, all the pages are written now (an example page in the image at right), so the next step is closing code to take care of letting me know what happened, so this is just a total page saved to a file as a log, really, but is output in html.

A General Index Page of Output HTML.



totlines = totlines + (series * 80) + paging
totpages = series + 1
htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_totals.html"
set output = conn1.opentextfile(htmlOut, 8, True)
output.writeline("")
output.writeline("<!-- DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2 Final//EN'-->")
output.writeline("<html>")
output.writeline("<head>")
output.writeline("<title>Current Research in the Pleistocene - Volume 3, 1986</title>")
output.writeline("</head>")
output.writeline("<body bgcolor=ffffff link=0000ff vlink=8e2323 alink=00009c>")
output.writeline("<basefont face='verdana,arial,helvetica' size=2 color=000000>")
output.writeline("<font size=4 color=800000 face='Arial,Helvetica,Verdana'>Current Research in the Pleistocene Volume 3, 1986</font><br>")
output.writeline("<font size=1>Records Processed: " & totlines & "</font><br>")
output.writeline("<font size=1>Pages Processed: " & totpages & "</font><br>")
output.writeline("<font size=1>Date: " & now() & " Seattle, WA</font><br>")
output.writeline("<font size=1>Dynamic Content Resources by<br><a href='http://www.mallard-design.com/'>Mallard Design Company</a></font>")
output.writeline("</body>")
output.writeline("</html>")
output.close
set rsInput = nothing
%>

The Code Rearranged for Authors.
This coding is then modified for creating the author index by changing output and field names, with separately developed pages for bibliography. The content itself is plain html with few images, as content is added the tables are updated and subsequent builds will have the new information linked in. If the content had been already in electronic format, the project would be finished in five days for the amount of work. The editing is almost as slow as manual entry, so it's the bottleneck on this particular job.

Of course this is just the publishing end of the process, yet by establishing the databases with the information, future content manipulation is easy. Editing field separators into lists so they can be imported into spreadsheets and databases as fields is a key issue for this type of work, parsing fields with software is not as reliable as a human operator in this case, too many variables in the content, so using that expensive time to establish the data structure is as important as its' quality. Since quality is very important to academic content, this system really helps to organize and validate the data as it gets processed and yet is pretty easy to develop.

The next page has the entire code, the displayed code was tested was copied and pasted into a text editor, saved as an asp file and run.


T.Mallard



Remember to change the file locations to your system drives and resources. This coding runs fine with SQL/Access/Oracle...

<%@ language='vbscript' %>
<!-- # include virtual="/asp/adovbs.inc" -->
<% server.scripttimeout = 240 %>
<% buffers = true %>
<%
dim input, output, htmlOut, author, editor, pubdate, topic, source, locale, annotate, errs, err2
dim conn1, conn2, conn3, rsInput, txtOut, rsTemp, pagename, metatopics, secstart, minstart
dim metasOut
metasOut = "D:\Webshare\wwwroot\asp\csfa\csfa_metas.txt"
set conn2 = server.createobject("ADODB.Connection")
  conn2.open "DRIVER={Microsoft Access Driver (*.mdb)}; DBQ=D:\Webshare\wwwroot\asp\csfa\biblio_01.mdb"
  if not isobject(conn2)then
    errs2 = "unable to create access object"
    else errs2 = "access object created"
    end if
dim allfields, sql_01, sql_02
  sql_01 = sql_01 & "SELECT * from v3_86_index"
set rsInput = conn2.execute(sql_01)

rsInput.movefirst
  set conn1 = server.createobject("Scripting.FileSystemObject")
  if not isobject(conn1) then
    errs = "unable to create object"
  else errs = "file object created"
    end if
dim paging, aline, series, sNext, sPrev

while not rsInput.EOF
series = series + 1

for paging =1 to 80
if paging = 1 then
  if series < 100 then
    if series < 10 then
      htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_00" & series & ".html"
      sNext = series + 1
      if sNext = 9 then
        htmlNext = "csfa_010.html"
        else
        htmlNext = "csfa_00" & sNext & ".html"
      end if
    sPrev = series - 1
    htmlPrev = "csfa_00" & sPrev & ".html"
    else
    htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_0" & series & ".html"
    sNext = series + 1
    htmlNext = "csfa_0" & sNext & ".html"
    sPrev = series - 1
    if sPrev = 9 then
      htmlPrev = "csfa_009.html"
      else
      htmlPrev = "csfa_0" & sPrev & ".html"
    end if
    end if
  else
    htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_" & series & ".html"
    sNext = series + 1
    htmlNext = "csfa_" & sNext & ".html"
    if series = 1 then
      htmlPrev = "csfa_001.html"
      else
      sPrev = series - 1
      if sPrev = 99 then
        htmlPrev = "csfa_099.html"
        else
        htmlPrev = "csfa_" & sPrev & ".html"
      end if
    end if
  end if
set output = conn1.opentextfile(htmlOut, 8, True)
output.writeline("<!-- DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2 Final//EN'-->")
output.writeline("<html>")
output.writeline("<head>")
output.writeline("<title>General Index - Volume 3, 1986 - Current Research in the Pleistocene</title>")
output.writeline("<meta name='Publisher' content='Center for the First Americans'>")
output.writeline("<meta name='Publisher-email' content='info@mallard.design.com'>")
output.writeline("<meta name='Keywords' content='*'>")
output.writeline("<meta name='Description' content='Current Research in the Pleistocene - Volume 3, 1986'>")
output.writeline("<meta name='Identifier-URL' content='http://www.>")
output.writeline("<meta name='Content-Language' content='en-US '>")
output.writeline("<meta name='Coverage' content='Worldwide'>")
output.writeline("<meta name='Date-created-yyyymmdd' content='19990115'>")
output.writeline("<meta name='Date-Revised-yyyymmdd' content='*'>")
output.writeline("<script language=JavaScript>")
output.writeline("<!-- ")
output.writeline("//-->")
output.writeline("</script>")
output.writeline("<style type='css/text'>")
output.writeline("<!--")
output.writeline(".roll {")
output.writeline("font-family:Arial;")
output.writeline("font-size:14pt;")
output.writeline("text-decoration:none;")
output.writeline("color:black;")
output.writeline("}")
output.writeline("//-->")
output.writeline("</style>")
output.writeline("</head>")
output.writeline("<body bgcolor=ffffff link=0000ff vlink=8e2323 alink=00009c>")
output.writeline("<basefont face='verdana,arial,helvetica' size=2 color=000000>")
output.writeline("<font size=4 color=800000 face='Arial,Helvetica,Verdana'>Current Research in the Pleistocene<br>")
output.writeline("Volume 3, 1986</font><br>")
output.writeline("<font size=3 color=800000 face='Arial,Helvetica,Verdana'><b>General Index</b></font><br>")
output.writeline("<font size=2 color=800000 face='Arial,Helvetica,Verdana'><b>Page " & series & "</b></font><br>")
output.writeline("<table><tr><td>")
end if

if paging = 41 then
    output.writeline("</td><td>")
  else
    if trim(rsInput.Fields("topic")) <> "" then
      metatopics = metatopics & trim(rsInput.Fields("topic")) & ";"
    end if
  aline = trim(rsInput.Fields("topic")) & "  " & trim(rsInput.Fields("subtopic")) & "  "
  if trim(rsInput.Fields("pages")) <> "" then
  aline = aline & "-- " & trim(rsInput.Fields("pages")) & "<br>"
  else
  aline = aline & "<br>"
  end if
  output.writeline(aline)
  aline = ""
end if
rsInput.movenext
if rsInput.EOF then
  paging = 81
    output.writeline("</td></tr></table>")
    output.writeline("<p><center><table cellpadding=4 cellspacing=2 width='10%'><tr><td align=center bgcolor=silver><a href='" & htmlPrev & "'><font face='Comic Sans MS' size='4'>Previous</font></a></td><td align=center bgcolor=silver><a href='csfa_001.html'><font face='Comic Sans MS' size='4'>Start</font></a></td>")
    output.writeline("</tr></table></center><p>")
    output.writeline("<font size=1>Dynamic Content Resources by<br><a href='http://www.mallard-design.com/'>Mallard Design Company</a></font>")
    output.writeline("</body>")
    output.writeline("</html>")
    output.close
end if

next
  set conn3 = server.createobject("Scripting.FileSystemObject")
  if not isobject(conn1) then
    errs = "unable to create object"
  else errs = "file object created"
    end if
set txtOut = conn3.OpenTextFile(metasOut, 8, True)
txtOut.writeline(metatopics)
txtOut.close
metatopics = ""
if not rsInput.EOF then
output.writeline("</td></tr></table>")
if series > 1 then
output.writeline("<p><center><table cellpadding=4 cellspacing=2 width='10%'><tr><td align=center bgcolor=silver><a href='" & htmlPrev & "'><font face='Comic Sans MS' size='4'>Previous</font></a></td><td align=center bgcolor=silver><a href='" & htmlNext & "'><font face='Comic Sans MS' size='4'>Next</font></a></td>")
else
output.writeline("<p><center><table cellpadding=4 cellspacing=2 width='10%'><tr><td align=center bgcolor=silver><a href='" & htmlNext & "'><font face='Comic Sans MS' size='4'>Next</font></a></td>")
end if
output.writeline("</tr></table></center><p>")
output.writeline("<font size=1>Dynamic Content Resources by<br><a href='http://www.mallard-design.com/'>Mallard Design Company</a></font>")
output.writeline("</body>")
output.writeline("</html>")
output.close
end if
wend

totlines = totlines + (series * 80) + paging
totpages = series + 1
htmlOut = "D:\Webshare\wwwroot\asp\csfa\csfa_totals.html"
set output = conn1.opentextfile(htmlOut, 8, True)
output.writeline("")
output.writeline("<!-- DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 3.2 Final//EN'-->")
output.writeline("<html>")
output.writeline("<head>")
output.writeline("<title>Current Research in the Pleistocene - Volume 3, 1986</title>")
output.writeline("</head>")
output.writeline("<body bgcolor=ffffff link=0000ff vlink=8e2323 alink=00009c>")
output.writeline("<basefont face='verdana,arial,helvetica' size=2 color=000000>")
output.writeline("<font size=4 color=800000 face='Arial,Helvetica,Verdana'>Current Research in the Pleistocene Volume 3, 1986</font><br>")
output.writeline("<font size=1>Records Processed: " & totlines & "</font><br>")
output.writeline("<font size=1>Pages Processed: " & totpages & "</font><br>")
output.writeline("<font size=1>Date: " & now() & " Seattle, WA</font><br>")
output.writeline("<font size=1>Dynamic Content Resources by<br><a href='http://www.mallard-design.com/'>Mallard Design Company</a></font>")
output.writeline("</body>")
output.writeline("</html>")
output.close
set rsInput = nothing
%>

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved