I’ve done some more calculations and I’m finding that for every 30 minutes of processing, we’re actually reading content from files for 3% of the time ( < 1 minute) and writing data to the database roughly 97% of the time ( ~ 30 minutes). This means that if we take 3% of the total time spent reading/writing (1,187,328 seconds), that gives us (39,577.6 seconds), and therefore this means that the actual reading speed is 6,406.29 relationships per second.
The fact that we are writing to a database for 97% of the time seems to indicate that we should attempt to increase that performance so we can take advantage of the reading speed. Another important fact is that the speed of writing data to the database has not changed since the beginning of this instance, which means that database size isn’t (yet) affecting performance. Ideas that I have so far for increasing performance are to (1) use a RAID 0, or (2) use a cluster of databases. Another options might be to optimize database configurations, although I’m not sure how much that would help things… although it may significantly increase performance. I’m no DBA, so I can’t say for sure.
The following are slightly more optimistic figures regarding the speed at which we are parsing data:
| unique relationships: | 66,024,755 |
| total relationships parsed: | 253,545,925 |
| relationships per second: | 213.54 |
| relationships per minute: | 12,812.60 |
| relationships per hour: | 768,755.84 |
| unique rel’s per second: | 55.61 |
| unique rel’s per minute: | 3,336.47 |
| unique rel’s per hour: | 200,188.25 |
Here are some sobering figures regarding the parsing of Wikipedia. Such a large corpus of text and so much more to go!
| total seconds of parsing: | 1187328 |
| total files parsed: | 31 |
| total files: | 564 |
| files remaining: | 533 |
| percent completed: | 5.50% |
| minutes: | 19788.8000 |
| files per minute: | 0.0016 |
| minutes per file: | 638.3484 |
| minutes till complete: | 340239.6903 |
| hours: | 329.8133 |
| files per hour: | 0.0940 |
| hours per file: | 10.6391 |
| hours till complete: | 5670.6615 |
| days: | 13.7422 |
| files per day: | 2.2558 |
| days per file: | 0.4433 |
| days till complete: | 236.2776 |
| weeks till complete: | 33.7539 |
| months till complete: | 7.8759 |
Over the past few years I’ve tried to find out if the term “coherence” is an appropriate word to use when trying to describe part of TransluSense’s objective. Wikipedia does have an article on linguistic coherence, and it seems to back my perception of what coherence means with respect to language and word usage. As one of TransluSense’s objectives is to build a systematic algorithm for “gauging coherence”, the definition of “Coherence” must be carefully presented. The Wikipedia article (I think) does some justice- simply put: “Coherence in linguistics is what makes a text semantically meaningful.”
For the layperson, I’ve considered a text to be “coherent” if a native speaker of a language would agree that the text “made sense”. Once the native speaker is unable to understand the meaning of a sentence, it is no longer “coherent.” And further more, my belief is that there are varying levels of “coherence” as something can “barely make sense” and something else can be much clearer. This is clearly a topic to do lots of research on and make a very specific attempt at defining coherence and how gauge-able it in fact is.
For those of you that are visiting this site for the first time, here is a definition of TransluSense:
TransluSense is a platform of software and information that provides up-to-date information on language usage to 3rd party applications. The concept is that these 3rd party applications require more than the traditional built in grammar tools. They would benefit from obtaining data on the latest language usage patterns that supersede grammar rules and focus on colloquial language usage and actual word order patterns according to a specific context. The applications of this type of tool range from forensic linguistics to improving Grammar tools in any word processor.
The software platform has been built but it is currently processing data and building its “knowledge base” by reading large quantities of corpora. When the parsing is complete, extensive testing will be performed to determine how well the currently established algorithms work.
Welcome to the TransluSense blog. I’ve created this Blog in order to publicly communicate statuses on my R&D regarding TransluSense and related material. I’m a working professional so I devote part of my hobby time to working on TransluSense, hopefully I can get a significant amount of blogging done so that I can log what goes through my head.
