#1
Reply
Email
|
MMS |
Observer
8/20/2001 11:27:10 AM |
I don't know how far is this search engine, so please inform me. So far it looks just like metacrawler!
What about NLU and stuff like that?
If you need any graphics, just tell me, I'll make some. I can do it in Meson style too.
(What I'm saying is that the return page is too allike metacrawler, could you change it from text to pictrues?) --
Wish you a lot of theta brainwaves! |
#2
Reply
Email
|
MMS |
Upstart
8/23/2001 9:41:55 AM |
In Reply to #1 I've tried out the MMS and it does seem quite like metacrawler. What are the advantages of MMS? or are the advantages at a backend level?
-Tim [upstart.nu] |
#3
Reply
Email
|
MMS |
Observer
8/23/2001 10:27:39 AM |
In Reply to #2 For now the only advantage is that it's ours and we can thus do anything with it we want.
The point is that we might intigrate it with our other software, such as IOS, any NLU and/or voice recotnigion. --
Wish you a lot of theta brainwaves! |
#4
Reply
Email
|
MMS |
Observer
8/27/2001 8:00:49 AM |
Neil Nelson wrote: I need to figure out how to automatically get the items from the return search engine HTMLs and expect that some pattern recognition methods may do the trick.
I am just about ready to test the new return HTML extraction method. I am very hopeful for this new design.
Good news, I think I have solved the above mentioned problem. Here is my works (CAPITALS represent names):
--- Begin quote --- 1. Brake the html into KEYS with VALUES. KEYS should be named A, B, C, D, etc. VALUES should be the hypertext within. KEYS should be determined by html tags (regarding text formatting) or table rows.
2. These KEYS should be fed into PARTTENIZER. PATTERNIZER returns one COMBINATION OF KEYS (called "return keys"). The COMBINATION OF KEYS is added to a DATABASE.
* Example * "ABCDCDCDEF" will return "CD". * Example *
3. The html is then scaned for KEYS and VALUES again and if a KEY matches any of the KEYS from the COMBINATION OF KEYS, it's VALUE is copied to the DATABASE togather with it's KEY'S name.
* Example * KEYS: "ABCDCDCDEF" VALUES: "0123456789" entries = "CD + 23 45 67" * Example *
4. The PAGE BUILDER then constructs a "Results" page using the DATABASE.
* Example * 2: 3 4: 5 6: 7 * Example * ---- End quote ----
I'm not saying it's perfect, but it's a working solution. Oh, yes and to the others: PATTERNIZER is a program I just made, E-mail me if you want a copy. It's the devoloper version, but source code is ready, equiped with comments and can be easily scavenged for usefull code for a PRO version. --
Wish you a lot of theta brainwaves! |
#5
Reply
Email
|
MMS |
Neil
8/27/2001 9:35:36 AM |
In Reply to #2 Tim, Thank you for looking at the MMS. The new version should be ready shortly. |
#10
Reply
Email
|
MMS |
Neil
8/27/2001 9:48:05 AM |
In Reply to #9 That was 320 characters which seems to be more than a previous message and means that the mere length is not the problem. Perhaps it something to do with the characters present as for example periods that may cause the receiving software to fail at some point. It might be the use of periods or it might be the automatic line wrapping. So let us see. |
#11
Reply
Email
|
MMS |
Neil
8/27/2001 9:51:04 AM |
In Reply to #10 That was 320 characters which seems to be more than a previous message and means that the mere length is not the problem. Perhaps it something to do with the characters present as for example periods that may cause the receiving software to fail at some point. It might be the use of periods or it might be the automatic line wrapping. So let us see.. But let us see if a double return for a paragraph break causes a problem. And since that prior paragraph break could not be posted that appears to be the problem. |
#12
Reply
Email
|
MMS |
Neil
8/27/2001 9:55:22 AM |
In Reply to #11 Let us see if one pressed enter key causes the problem. As I just entered. Yes, it is any use of the enter key in Netscape on Linux that causes the post not to be sent. This may be related to the way Unix/Linux codes the enter key as a single character--I believe asc(10)--while MS codes the enter key as two characters--I believe asc(10),asc(13). But in any case as long as I do not use the enter key I should be able to post. |
#13
Reply
Email
|
MMS |
Neil
8/27/2001 9:59:21 AM |
In Reply to #12 Also the action of the screen is that it gives me a security alert when the message is able to be posted--and the usage of security alerts may be particular to the way I have my Netscape set up--, and does not give me a security alert when it cannot be posted. This suggests that it may be a problem on the Netscape/Linux side as against a problem at the MesonAI side. But since Ryan has Netscape/Linux, he may be able to confirm that there is a newline issue and may have an idea. But at least I am posting to the forums though not with great style. Regards, Neil Nelson |
#14
Reply
Email
|
MMS |
Neil
8/27/2001 10:08:17 AM |
In Reply to #4 Jure, thank you for describing your Patternizer. I have just completed an initial test on the new item identification method, and it seem to work fairly well. But your method should come in handy. What we need to figure out is how to make it easy for our various bits of code to integrate with each other easily so that I can just take your source code or a compiled library and use it with my programs with little effort. Regards, Neil Nelson |
#15
Reply
Email
|
MMS |
Observer
8/27/2001 12:28:14 PM |
In Reply to #14 Simple, I send you the engine source, you read it, I explain if anything is unclear and then you recode it into your code or I translate it to Quick BASIC and use a BASIC to C translator, then you just copy the code.
Everything is prepared. --
Wish you a lot of theta brainwaves! |
#16
Reply
Email
|
MMS |
Neil
8/27/2001 1:13:13 PM |
In Reply to #15 Dear Jure, Though it may be an imposition, I think we should be dealing with the C code and would very much appreciate seeing the C code result. I should not delay the MMS, but I could soon take your C code and compile it on Linux and see how it and the sequence of going from Basic on MS to C on Linux works. If it does work then that may be an interim solution. Regards, Neil Nelson |
#17
Reply
Email
|
MMS |
Observer
8/27/2001 1:45:03 PM |
In Reply to #16 Great.
I have just downloaded the BCX translator and cracked trough it's rather non-existant how-to-use help files. I now only have to make sure that the Visual BASIC code is compatible with the BCX, which is actualy a language of it's own: a cross between Quick, Power and Visual BASIC.
As far as I see it now, it creates a *.c file that has something to do with MS Visual C++.
I'll keep you informed about the process. --
Wish you a lot of theta brainwaves! |
#18
Reply
Email
|
MMS |
Observer
8/27/2001 1:48:18 PM |
In Reply to #17 Anyway, I saw what C code does the translator make. Even if it dosen't point out to be usefull it did something good: Now I feel compeled to learn C. It's too simple to be true.
Ryan, are you with me? =] --
Wish you a lot of theta brainwaves! |
#19
Reply
Email
|
MMS |
Observer
8/27/2001 2:27:30 PM |
Hey, Ryan! Care to update the information on the main site regarding the MMS? It's temporarily off now, it dosen't work at the time.
Thanx. --
Wish you a lot of theta brainwaves! |
#20
Reply
Email
|
MMS |
Meson Cyborg
8/27/2001 3:18:12 PM |
In Reply to #13 If you are under Linux and using Netscape - Konquerer works just like IE - I don't think the enter key works but I don't see how it would result in a message not being posted. In any case, you can just add a html line break tag (<br>).
Thanks, bye.
Also, I have deleted those messages that were just long numbers because they made this thread look like garbage and hard to read.
-Ryan Morris (M) |
#21
Reply
Email
|
MMS |
Meson Cyborg
8/27/2001 3:19:56 PM |
In Reply to #19 Jure, what are you talking about - "temporarily off"? What would you like to add to the main page?
-Ryan Morris (M) |
#22
Reply
Email
|
MMS |
Observer
8/28/2001 4:45:22 AM |
In Reply to #21 Well is it working? Last time I checked it didn't. Well there is little for visitors to "check out" then, eh? --
Wish you a lot of theta brainwaves! |
#23
Reply
Email
|
MMS |
Observer
8/28/2001 9:53:10 AM |
In Reply to #17 Good news: I successfully translated the program into Quick BASIC source (it took me a few extra subroutines, because Quick BASIC does not have listboxes, that are vital to the program) and translated it into that "C" via BCX. I've used an option I didn't know of before, that saves the source "in UNIX format"; I don't know what that's supposed to mean, so inform me if you notice a diffirence, Neil. --
Wish you a lot of theta brainwaves! |
#24
Reply
Email
|
MMS |
Observer
8/28/2001 11:29:23 AM |
In Reply to #23 Oops! Forgot to fix something. "The program" is the patternizer engine, that we were considering on using with the MMS to provide automated pattern learning from search engine output (MMS input). --
Wish you a lot of theta brainwaves! |
#25
Reply
Email
|
MMS |
Neil
8/28/2001 12:51:24 PM |
In Reply to #24 When my computer is up there is an MMS page saying that it is temporarily not working, but it should be working shortly. I was up till after 2am rewriting the return HTML item selection routine. The first one used a recursive method (a subroutine that calls itself), but that just got too complicated. I am now writing an HTML maintenance program to easily enter in another 10 search engines and provide a way to quickly get at the search engine detail after a more automated method of item extraction is identified. I quickly unzipped the Patternizer_C.zip but have not tried to run it since I am working full-out to get the last MMS work done. What we may now need is to have a way of either providing the C++ subroutines I have to Jure that will connect to the Patternizer or some method for Jure to be able to assemble the code on my computer. That is, if I understand Jure's procedure correctly, he is coding in VB and then translating to C, but then he does not likely have the ability to integrate C code modules and compile them. What is required is a small subsystem utilizing some of my routines in conjucntion with Jure's routine that identifies possible search sites and then identifies the item extraction pattern. I could easily provide Jure with the basic routines that would help him assemble that subsystem. Jure, if you are interested, can you work in C on your computer or will we need to set up some remote communication for my computer? Regards, Neil Nelson |
#26
Reply
Email
|
MMS |
Observer
8/29/2001 5:55:53 AM |
In Reply to #25 Software should not be a problem, I just installed MS-VC++ on my computer, however my knowledge of C++ would be. I'd need tons of comments in the code you provide, Neil.
Thanx. --
Wish you a lot of theta brainwaves! |
#27
Reply
Email
|
MMS |
Neil
8/29/2001 10:28:48 AM |
In Reply to #26 Very good Jure. I do need to see if I can get your code running to in part show the essential process of having code blocks transferable from computer to computer and MS to Linux.The primary objective then is to take the HTML return from an arbitrary search engine and then secondarily from an arbitrary source that uses a list format with links and get three items: (1) The link or anchor that allows the user to click on the name and be sent to the link, (2) The name of the link that appears after the beginning anchor tag 'a ...' (and enclosed in angle brackets but those will not appear here because of the automatic HTML translation) and before the ending anchor tag '/a', and (3) Any description that commonly appears after the link name (2). I have just sent you by email some basic routines for parsing and writing the tag structure from an HTML file. The objectives would be to: (1) Use the Google.HTML file as input and get the two files Google.tree and Google.ankr as output. (2) Then you would want to run the procedure against an HTML you get from using one of the search engines. (3) Get back to me so that I can send you my current item extraction routine using the already sent code. (4) Integrate your Patternizer code with my item extraction routine and test it against search engine HTML pages that you can save or copy. You will likely have a number of questions and issues, and I should remark that I just pasted the jureparse.cc routine together from pieces of the MMS and have not attempted to compile it. I am now to the point of doing miscellaneous cleanup on the new version of the MMS. Regards, Neil Nelson |
#28
Reply
Email
|
MMS |
Observer
8/29/2001 1:58:43 PM |
In Reply to #27 I think it would be good if, from now on, you send attachments ZIPped. This is because my Windows Netscape dosen't allow me to manualy export files that are text-encoded, but the automatic method screws all the formatting.
I don't know anything about C so far, except the things that are similar to those in Visual BASIC (that's quite a lot, because I use Visual C++). It would be very nice for me, if most of your program's capabilities were in functions with comprehendable names and possibly with their describtions in comments. This should be a common practice in general (in ALL languages) if we're going to do any team programing.
I would need info on what a particular function does, because as said, I cannot read C code as I would read VB. For my task I would need to know how to: 1. Open an URL and read the HTML 2. Seperate the HTML syntax into Tags and Text in tags (text in links are the links themselves) 3. Remove all Tags except the text formatting tags (size, style and colors) and links. 4. Build a String of characters of Tags in the sequence that is indentical to the sequence on the source page, where each character represents one Tag. 5. Feed the String into Patternizer 6. Get the Return sequence out of Patternizer 7. Store the Return sequence in a database togather with the search engine name (e.g.:www.altavista.com) 8. Remove the Tags and their Text that have characters assigned to them that cannot be found in the Return sequence from the HTML 9. Assign new text formatting Tags to existing Text, where all Text with same old formatting have same new formatting. 10. Add a new header and footer to the results so far
If there are functions that do any part of this, names please. As I see it now, I will have to optimize my Patternizer in order to remove the annoying string and character aspect altogather. There will be arrays of Tags in place of Strings of characters. I'm upgrading it now.
I'll still need to know what is already done. I'll then make Patternizer do most of the work and code it in VB, then translate that into C and try to at least help you intigrate it all togather and fix the translator bugs.
Sorry for the thinking-while-typing, but it prooved usefull. C'ya! --
Wish you a lot of theta brainwaves! |
#29
Reply
Email
|
MMS |
Neil
8/29/2001 6:04:26 PM |
In Reply to #28 If you look at the Google.tree file you will see that it has identified all the HTML tags with each line either identifying a tag or the beginning of text line that is given a null tag.You do not need to open a URL and read an HTML file at this time since you only need to work on examples from particular engines or a small group of engines. I have always tested against examples I have copied to disc because it is easier, faster, and the HTML remains exactly the same each time. You appear to be saying that you need to have each tag represented by a different character which then requires you to build a tag name vs. character translator/lookup. Once you get the get_html_tags routine running it will extract tags of whatever name and you can then accumulate together several files and sort on tag and in that way identify what tags need to be in the table. You will likely need to modify the write_html_tree routine to not print the leading periods that indicate how deep the tag is nested in order to sort and extract the tag names properly. But for the moment you only need to study the code in the jureparse.cc file starting with the routine called main or at the line: int main(). This is where the program starts. // This line declares a string variable file_name and assigns `Google' to that variable. string file_name="Google"; // This line takes the result of get_res_test above in the file and assigns it to res which // is declared global. The purpose of get_res_test is to make a memory file or character // string from the disk file before processing it. Normally the MMS uses a memory // character string as the return from the search engine and there is no disk file. // It may be easier to just read the disk file directly and that is what the last routine // in jureparse.cc does. And then you could take code out of get_res_test to make that // happen. Once you get into how to open and read disk files, it should be easy to follow. res = get_res_test(file_name); // Load file into res // This declares a integer variable and assigns 0 or NULL to it. This null character is // then appended to the file_name for use by the C++ file opening routine // `ihtml.open(ifile_name.c_str());' which is above in jureparse.cc. int end_null=0; // The following two string variables and assignments assemble the output file names. // You will want to change the directory location to one you will be using. It should be // possible to remove the initial quoted string and just start with `file_name" to have // the output go to the directory you run the program in.// *** put in your own output location directories here. string tree_file_name = "/home/n_nelson/web_search/nnget/prime_result_trees/" + file_name + ".tree" + char(end_null); string anchor_file_name = "/home/n_nelson/web_search/nnget/prime_result_trees/" + file_name + ".ankr" + char(end_null); // This declares a variable that is defined in htmlparse.h. The purpose of this variable is // to allow easy handling of both the tag tree and anchor linked list obtained from // get_html_tags. Html_Tag_Classes html_tag_classes; // This is how comments are displayed to the screen when the program is running. // The flush line is used for the Apache web server so that when a program aborts I can // get as much output as possible. Otherwise messages in the buffer are lost. cout << "before get_html_tags \n"; cout.flush(); // This line runs the get_html_tags outine that extracts the tags and anchor tags. html_tag_classes = get_html_tags(html_tag_classes); // Runtime display as previously. cout << "write_anchor_tags html_tag_classes.top_anchor_tag= " << html_tag_classes.top_anchor_tag << " \n"; cout.flush(); // This line calls the routine that writes the Google.ankr file containing the anchor // tags and related detail including link name according to their sequence in the file. write_anchor_tags(anchor_file_name, html_tag_classes.top_anchor_tag); cout << "write_html_tree html_tag_classes.top_tree_tag= " << html_tag_classes.top_tree_tag << " \n"; cout.flush(); // This line calls the routine that writes the Google.tree file. write_html_tree(tree_file_name, html_tag_classes.top_tree_tag); // The following lines clear allocated memory. delete_html_tag_tree(html_tag_classes.top_tree_tag); delete_anchor_tags(html_tag_classes.top_anchor_tag); delete [] res; return 0; } The get_res_test routine above the main routine just opens the input disk file Google.html and copies it into a string variable and then copies that string variable to a char type variable which is the kind returned by the search engine and makes on-line and off-line testing easier. But you only need to do items 3 through 6 above at this time. I am using a different detection method based on anchor tags at the moment without any pattern matching but can see that a combination of the new method and the pattern matching method will be better. Ths point being that when you get to the point of applying the Patternizer, you may be able to use my current anchor method to improve the overall results. Regards, Neil Nelson |
#30
Reply
Email
|
MMS |
Neil
8/29/2001 7:11:01 PM |
In Reply to #29 And then after thinking about it a bit more .... Likely the first thing to do is to copy into another C++ progam the following lines from jureparse.cc:All the lines from the top and down to and not including the line beginning the get_res_test subroutine. Append to the previous copy the lines just referenced for the main routine. A main routine is required for all C programs as the first routine called from the OS when the program is executed. And then comment out all the subroutine calls using two forward slashes `//'. Those would be get_res_test get_html_tags write_anchor_tags write_html_tree the three following lines beginning with `delete'. This should then allow that first main routine to compile and execute, showing the display lines. And then we can add get_res_test routine to that new program and remove the comment for that call and see if it compiles. Then we can run the program to see if both routines work together. Then we need to see if the htmlparse.cc program will compile. The routines from this file will be linked later, and the compile method will likely (as done on Linux) need to reflect that. And that gets us quite far along. Ragards, Neil Nelson |
#31
Reply
Email
|
MMS |
Observer
8/30/2001 5:27:43 AM |
In Reply to #16 I have found a new BASIC to C translator. For the diffirence this one is made for UNIX and translates real Quick BASIC to real C.
Here is the download link: ftp://darkstar.irb.hr/pub/qb2c/qb2c-3.40.tgz
And the manual link: http://faust.irb.hr/~stipy/qb2c/manual.txt
--
Wish you a lot of theta brainwaves! |