Feasibility of a personal knowledge management system based on statistical analysis and data mining

We start from the principle that natural language is too complex for computer programs, that’s why it is difficult to have simple programs that can mine natural language in an effective way. Therefore, if we could find a way to simplify the language, it would make it easier to design computer-assisted knowledge management programs. The proposed approach is to simplify the natural language into sequences of predefined keywords. The translation from the natural language to the list of keywords is done by systematic application of a set of regular expressions: when a pattern is matched, it translates into a key word.

The keywords obtained are put in sequence to re-express the main ideas in the natural language, but using only the pre-defined keywords. This way, we will have achieved the translation from the complex natural language to a keywords-based language. Given that user intervention is required, the system should be seen as a personal knowledge management system: encoding the knowledge you acquired through readings in a way that allows you to easily mine it later.

Question: How do we define an optimum set of keywords, patterns and translation rules?

This is the proposed approach:

1)      In the presence of a new text, apply all the already exiting patterns and rules to the text and get all the proposed keywords. You will chose to keep the ones that are in line with the ideas in the text and reject those deemed irrelevant.

2)      Try to express the important ideas in the text using the keywords obtained. If some important ideas cannot be expressed, it means additional keywords are needed. Define new keywords and corresponding patterns and rules and express the ideas.

3)      Systematically collect statistics of use of the keywords and patterns: which keywords were proposed, which ones were accepted, which ones were rejected, which keywords have tendency to appear together, which ones are exclusive, which ones have high frequency of use, which ones have low frequency of use, etc.

4)       The analysis of the statistics will allow detecting sub-optimal situations. For example, if too keywords have the same meaning, they will have tendency to occur together all the time. You may consider merging the two keywords in one key words; or, if a keyword is rarely used, you may consider removing or replacing it for optimality. More complex statistical analysis or data mining techniques will be defined for better optimization.

5)      The process will progressively lead to an optimal set of keywords and patterns. From this point, the knowledge encoded using this keyword language will be very easy to mine using simple programs.