How places work
The way places work in the counter can sometimes be confusing.
This note attempts to explain how it works a little bit.
NOTE: This is valid as of Wed Feb 21 11:19:37 2001. Stuff changes without notice!
The user input
The user who registers is given three fields to fill in: Country, State and City.
For most city dwellers, the first and last will be obvious.
The middle one is obvious in some places (like the US), but far from obvious in other places.
The places database
The counter has a database of all places in the world. (hah!)
For each place, it stores:
- An unique number
- A name (used in the matching routine below)
- A longname (used when the place is mentioned)
- A "within" field, showing which other place contains this place
Places form a strict hierarchy in the counter, unlike the real world.
- Various info like hostcounts, population, usercounts and so on, which does not
concern the placement of users.
The matching process
When someone enters a country, city and state, the matching routine forms two names:
- Country:State:City
- Country::City
These are formed by taking the user input, imagining that it is in the ISO 8859-1 charset,
folding it to a reasonable approximation in ASCII, and regularizing spaces and punctuation.
Both are looked up through the alias process.
The alias process does the following steps:
- Look up the name
- If not found, remove the last colon (:) and everything following, and try again from 1.
- If found, and type is alias, use the "within" field of the record to find the
aliased record. Do 3 again (and give warning) if it is still an alias.
- If step 2 removed something, add it back onto the found alias, and start over again from
step 1.
This allows aliases for countries and states to find cities, for instance.
If both the names formed in the matching routine return something, the name with the most
components (largest number of colons) wins.
If neither returns anything (for instance when the country is misspelled in an unique way),
the email address of the user is checked to see if one can infer a country from that.
The result of the lookup is stored as a name (not a number!) in the "placeid" field of the
person record, and the type of lookup (place, alias or email) is stored in the "placesource"
field of the person record.
For those interested in source, this is in the "lib/Validate/Places.pm" file, in the "getbyname"
subroutine.
(Parenthesis: One reason for the arcaneness of the subroutine is that it actually does
no search; all lookups are exact. This was thought at the time to be a speed advantage...)
The reporting hierarchy
When preparing the list of places and persons within them, the counter strictly follows the
links given by the "Within" fields, starting at the root ("All").
All places within the place are listed in alphabetical order by longname (not name!), followed by a list of
users at this exact place. This will change!
This means that the naming of places does not matter in the published user lists.
Points to ponder
These are things to think about. I do not know if they are answers.
When cities are unique within a country, it may be best to name them as "country::city".
This will mean that people are placed in the same city no matter how they spell their state.
If this is the case for all cities in a country, adding states is only a way to control
the listing of places. This can be nice.
It is perfectly possible to have multiple levels of state. However, the state field is
single in the matching algorithm. It does not make sense to have "country:state:state:city"
as the name of a place; the matching algorithm will never construct a name that fits this.
When people live outside a city, they frequently put some small subdivision ("county") as
either their state or their city, essentially at random. It is hard to know what to do here.
The current sad state of charsets on the Web means that some people will enter data that is
in some other charset than ISO-8859-1, like Shift-JIS, KOI-8 or 8859-2. This data will look
quite ugly when "converted" to ASCII under the assumption that it is ISO 8859-1, leading to
aliases like "PL::Lod4" for the Polish city of Lodz (the character in 8859-2 that represents
Z with hacek is a superscript 4 in 8859-1).
Ultimately, the counter will switch to UTF-8 for its internal representation, and try to get
proper charset marking on data coming from users. This may (or may not) help.