Originally posted by walterkatz
reply to post by badmedia
Badmedia, thank you for the quick reply.
That was exactly what I was looking for. Thanks for clarifying it for me.
I have more questions now.
1) This software stores atomic data (actual "cat", "butt", etc.) in tables, right? They are not somehow embedded in vectors?
How people do it will vary. I personally keep a table with those, and the values for quicker and easier lookup, rather than looping through an
entire array/vector for the appropriate key. So, rather than looking across the array/vector for a match on a word, I just search the table and
get the value for it.
If memory permits, then I might load the base keys into a hash, and then use the keys to get the value(same thing as the above in effect, just faster
because in memory rather than searching a table).
2) If I have 20 billion atomic values and one such vector ("The dog killed the cat", for example), this vector will have 19,999,999,997 zeroes in
it, each for an absent atomic entry. Each such vector has to be parsed entirely to determine whether or not it has all other atomic values. Such
parses will take several hours each. Is this correct? How is this addressed performance-wise?
Well, the english language only has 1 million words total counting scientific words, around 600,000 without them. So, that is quite a bit short of
20 billion, even if you include many of the other languages.
But in such a case you would likely use caches of existing results to speed things up. So that rather than doing the intensive action over and over,
we can just read a cache.
As well, it is pretty simple to create a vector/array that has a bunch of zero's in it. All you do is say create array(number of entries in it).
Null will be a value of 0, and then you just define the individual values that do exist into the specific slots. I honestly just use a little mod
that someone else created to do this, rather than doing it myself. No point in reinventing the wheel IMO.
3) Both kinds of vectors are stored: base vector containing all values and multiple individual vectors? If I want to find a vector that is close to
another vector I have to scan 20,000,000,000 rows atomic table, then find values and scan 20 billion vectors? How is this astronomical inefficiency
and space requirements resolved?
Well when you get databases that are that big, you are going to have to do things behind the scenes and such to get better efficiency. I've honestly
never had any kind of database even near to that kind of number, and when my databases get too big, they are usually somekind of logging and so I just
trim them up.
In such a case as that big of a number, I might just keep a key stored which can be converted into a vector on the fly to save space. It's really
hard to say because I've never been faced with the situation myself.
So, rather than have 20 billion zero's, I might just say place=value, and only store the values.
So if Dog is stored in vector 2, and has 1 occurance, I would save just that part 2=1, etc.
But what you have to keep in mind is what you are replacing that with. That is a huge amount of entries and such. But look at what that is in
replace of. If it wasn't that number you were using, you are looking at searching the entire database for matches on words. In the end, it should
prove more efficient.
Wish I could give you better answers, but I really don't know since I've never been faced with such problems. I could be way off and there may be
much better ways of doing it. In programming, there are always multiple solutions to a problem.