MantisBT - Pending Requests
View Issue Details
0000598Pending RequestsCore Infrastructurepublic2010-10-09 10:102016-12-28 00:17
Reportergajdusek 
Assigned Tocaseydk 
PrioritynormalSeveritymajorReproducibilityalways
StatusfeedbackResolutionreopened 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000598: adodb-mysql does not set default character set for db connection, this can lead in unusable database content
DescriptionHi,
adodb-mysql.inc.php functions _connect() and _pconnect() do not set default character set for database connections nor does web2project code elsewhere.

This results in wrong encoding of non-ascii strings stored in database depending on host configuration - to be specific if system wide mysql default_character_set differs from utf8, i.e. debian's default value is 'latin1'.

Wrong character encoding has at least following critical consequences:

* Sorting of strings with non-ascii charactes does probably not work correctly.
* Strings will be totaly messed up if mysql's default_character_set will be changed or database will be migrated to host with different configuration.

Adding mysql_set_charset('utf8') call to both mentioned functions or setting this later from web2Project code (if adodb modification is not an option) will solve this. This solution will have a side effect of messing up strings already stored in wrong encoding.
TagsNo tags attached.
has duplicate 0000731closed caseydk v2.3 Release (Closed) Support of UTF-8 in Database 
related to 0000587feedback pedroa Pending Requests mysql can't read chinese 
related to 0001227closed caseydk v3.0 Release After the update Incorrect code page of the base 
related to 0001671new  v3.5 Release (Development) utf8 not handled correctly on 3.2->3.3 and mysql -> 5.5 
child of 0000632closed pedroa v2.3 Release (Closed) Print function produces wrong charaters 
Attached Filespatch dbcharset_support.patch (7,252) 1969-12-31 16:00
https://bugs.web2project.net/file_download.php?file_id=136&type=bug

Notes
(0001314)
gajdusek   
2010-10-19 18:19   
(Last edited: 2010-10-19 18:44)
Hi pedroa, I have made a patch for my usage.

Changes summary:
* Added support for $w2Pconfig['dbcharset'] option in config.php.
* If this option is not empty web2project uses specified charset for database connection, otherwise old behavior is preserved (no connection charset setting)
* $w2Pconfig['dbcharset'] = 'utf8' added to config-dist.php so fresh installations will use correct charset.

Patch applies to 25e55c2ac8b7c9c3279ea93c1ab2818f7d8afc63 commit from http://github.com/caseysoftware/web2project.git
You can get the patch from http://github.com/gajdusek/web2project/commit/0848eb0acf2f5f354a08ea655b396bf90b2fc28a too.
If you decide to use this please check it well because I am really web2project beginner.

Scripts (bash and php versions) that generate SQL statements for database strings conversion to utf8 charset are located at http://github.com/gajdusek/web2Project-tools/tree/master/db_convert_to_utf8/

Intented usage:
* update web2project code
* convert database and imediately add $w2Pconfig['dbcharset'] = 'utf8' to config.

PHP version of the conversion script uses database credentials form config.php and scans tables in database for possibly affected columns. If system-wide connection charset is already utf8 script prints advice to change $w2Pconfig['dbcharset'] = 'utf8' and exits. Otherwise SQL statements for conversion are printed with additional information (and warnings) how finish the migration process.

petr

(0001318)
pedroa   
2010-10-26 06:28   
Hi Petr and thanks a lot for your report and patches.

Now lets resume what we want and where we are at, and what more needs to be done to get there.

So... we want utf-8 to be generally applicable to web2Project.

Now, what do we have at the moment...
1) The locales are encoded utf-8
2) The forms for the interfaces are encoded utf-8
3) The HTML headers are set to be encoded utf-8
4) There is loads of code inside we2bProject that specifically use php mb_ functions to not break multibyte characters and therefore utf-8 compatible.
5) Our sql scripts, specifically install/sql/mysql/006_special_characters.sql converts the tables to uft-8 encoding

Having said this, is there a need to keep enforcing it, or having the tables defined as "SET utf8 COLLATE utf8_general_ci" is enough for the mysql engine itself behave?

What I means is... I don't really think this is an issue considering that the tables themselves are already set to be utf-8 compatible from within mysql itself.

Thing is I don't want to be introducing mysql specific code inside web2Project even more than we already have, because when we are to add Oracle or SQL server or Firebird or PostgreSQL support in web2Project we are going to face more headaches.

What do you think?

Cheers,

Pedro A.


 
(0001319)
pedroa   
2010-10-26 06:32   
Oh by the way,

Have you seen http://bugs.web2project.net/view.php?id=587

What do you think about it?
I can live with adding a line of code (for mysql only) :)

Cheers,

Pedro A.
(0001328)
gajdusek   
2010-10-26 17:17   
(Last edited: 2010-10-26 18:47)
Hi Pedro,

> Having said this, is there a need to keep enforcing it, or having the tables
> defined as "SET utf8 COLLATE utf8_general_ci" is enough for the mysql engine
> itself behave?

We need to set database connection charset for several reasons. I will try to explain it later in my post.

> 5) Our sql scripts, specifically install/sql/mysql/006_special_characters.sql
> converts the tables to utf-8 encoding

I will skip consequences of this update script and keep focus on current state.
We have all columns marked as utf8 in database already. But we don't know db connection charset because we don't set it. Instead, system-wide setting is used.

My system-wide setting for db connection is latin1. Replace 'latin1' with "user's system-wide database connection charset setting" further in text. Imagine situation that web2project inserts or updates some db record:

1) Web2project sends utf8 string to db server
2) DB server saw it as latin1 (connection charset)
3) Character set of the column is utf8, so db server converts that string (that is in utf8 already) from latin1 to utf8. There are no problems visible in web2project because of reverse process when querying db. But there are 'hidden' problems:
  a) Strings are unreadable in database.
  b) Sorting probably does not work correctly.
  c) If system-wide connection charset will be changed (i.e. by system
     administrator or by server update - btw default MSSQL connection charset was
     changed from utf8 to latin1 in debian unstable this week - or setting
     connection charset in w2p directly) all goes wrong. See next step.
4) If we set db connection charset to utf8 now, we will receive damaged strings from db server.

So before connection charset will be changed (either by setting it by web2project or by system-wide setting change) user must convert database content:

1) convert affected column from utf8 back to latin1 charset
i.e.
ALTER TABLE `users` CHANGE `user_username` `user_username` varchar(255) CHARACTER SET `latin1`;

2) change this column charset to utf8 but make sure no conversion will occur.
i.e.
ALTER TABLE `users` CHANGE `user_username` `user_username` varbinary(255);
ALTER TABLE `users` CHANGE `user_username` `user_username` varchar(255) CHARACTER SET `utf8`;
The reason this works is that there is no conversion when you convert to or from BINARY columns. [1]

3) Repeat 1) and 2) for every affected column in database.
4) Rebuild indexes of converted tables.
5) Set connection charset to utf-8.

This five conversion steps must be done at once and be atomic (web2project should be down for maintenance).

So this is why it is not sufficient to simply change database columns charsets (as it is now) nor forcibly set charset of database connection. Users should convert their databases before and only then (but immediately) change connection charset. I design my patch in this manner. It does nothing until users leave $w2Pconfig['dbcharset'] unset. Scripts generating SQL statements for conversion are available at http://github.com/gajdusek/web2Project-tools/tree/master/db_convert_to_utf8/. Conversion script can not be easily and safely integrated into the web2project update code. I tried it, but it involves more code changes and is probably not reliable (connection timeout).

And now regarding http://bugs.web2project.net/view.php?id=587, I have not seen it, thanks, i will take a look...

Ok, that sets connection charset for mysql via mysql_query("SET NAMES 'utf8'"); in includes/db_adodb.php : db_connect() function.

I see these problems:

1) SET NAMES sql query is discouraged in PHP documentation [2]
2) mysql specific thing
3) We cannot force db connection charset to utf-8, because users must first convert (or check) their tables.
4) There are more functions connecting to database directly: classes/w2p/Core/UpgradeManager.class.php and install/manager.class.php

Please take a look at patch I sent on 2010-10-19 or http://github.com/gajdusek/web2project/commit/0848eb0acf2f5f354a08ea655b396bf90b2fc28a. It solves all these points:

1) Using set_mysql_charset() function as the preferred way to set connection charset and adding compatible function that uses SET NAMES to includes/backcompat_functions.php, because set_mysql_charset() is implemented only in PHP >= 5.2.3 [2].
This all will probably work only in MySQL >= 5.0.7 or, if you're using MySQL 4, then >= 4.1.13. [3], is this an issue?

2) ADODB has implemented charSet property for setting client charset. It supports i.e. Interbase/Firebird, Oracle, sybase and mysqli (according to documentation) but not mysql. My patch set this for all databases and calls set_mysql_charset extra for mysql database. So my patch is not mysql specific, but i did not test its functionality for other databases then mysql.

3) This is why I added $w2Pconfig['dbcharset'] and set connection charset only if user set this variable to something (i.e. utf8). Only for new installations connection charset will be used by default ($w2Pconfig['dbcharset'] = 'utf8' in includes/config-dist.php)

4) set database connection charset code added to:
includes/db_adodb.php:db_connect()
classes/w2p/Core/UpgradeManager.class.php
install/manager.class.php

I did not found any other code connecting to DB directly.

[1] http://codex.wordpress.org/Converting_Database_Character_Sets
[2] http://cz.php.net/manual/en/function.mysql-set-charset.php
[3] http://cz.php.net/manual/en/function.mysql-set-charset.php#80997

PS: sorry for many updates of this post, just correcting errors in my English

Cheers,
petr

(0001329)
gajdusek   
2010-10-26 18:24   
(Last edited: 2010-10-26 18:59)
Attached patch does not (hopefully) affect anything until you leave $w2Pconfig['dbcharset'] unset or do not fresh web2project install.

If you set $w2Pconfig['dbcharset'] to utf8 (or other charset) some requirements must fulfilled:

1) Database columns charset (utf8) and encoding of strings stored in them must match (satisfied on fresh installs or if system-wide database connection charset was already set to utf-8, otherwise database conversion must precede).

2) Code must be always succesfull in setting new connection charset before retrieving or updating data in database (or system-wide db connection charset must match $w2Pconfig['dbcharset'] i.e. utf8).

Ad 2)

a) This cannot be satisfied with mysql server < 5.0.7, so minimal requirements must be adjusted. There is mysql 5.0 requirement in the wiki [1], this must be raised to 5.0.7. Older 5.0.x versions do not support 'SET NAMES' query. And/or make advice to ensure that site-wide mysql connection charset is set to utf8, if applicable.

b) Make sure connection charset setting works well for other databases than mysql. i.e. check support in ADODB or insert custom code as for mysql. I have written no conversion scripts for other databases neither (I use only mysql). This does not seem to be problem now because web2project code appears totally unaware about other databases. There should be no problem with fresh installs using other database servers if ADODB can set connection charset via charSet property.

Converting of tables with existing data should be described in some article on the wiki. I just want to repeat: until user will not set $w2Pconfig['dbcharset'] variable or make fresh install nothing will happen. Finally, I can write some article about converting database content, only if somebody could correct my horrible English.

That reminds me: Sorry for my English!

Cheers,
petr

[1] http://wiki.web2project.net/index.php?title=Installation#Database_Server

(0001330)
pedroa   
2010-10-27 08:18   
Hi Petr,

First of all thanks a lot for all this research, because it is very well done.

You touched some sensitive points there and are:
1) The install/upgrade scripts are not prefix compatible, though the code itself is.
2) We need to handle/fix the connection encoding, because as you mentioned the DB engine we support right now is dumb and needs to be educated.
3) Should we worry about systems where their supervisors do not know (or don't care) if their encodings are correctly set (and match) system wide.

About the 1st, well it took some effort to make this thing prefix compatible and we need to sort something to make it that way. Could be string replacement on the install/upgrade scripts or whatever, and by that I mean another way of doing it.

Now nr 2, I believe we must find a way to set the encoding all right, but I prefer a one line of code approach, depending on DB engine, though for now we only have to worry about mysql as you correctly said.
I see that some Dbs react to $db->charSet before the Connect is done on ADODB, I see that mysql reacts to $db->EXECUTE("set names 'utf8'"); after the Connect.
And yes we go and simply change the wiki to the mysql version necessarily.
I don't see why we should add a w2PConfig for this, because utf-8 is the defacto encoding and that's it, period.

As to 3, too bad if they don't care, if they don't care, why should we?
But I appreciate your effort into creating a conversion procedure, but why should we need a conversion procedure when all you needed to do was to make sure everything matched?

It would bee very nice to have this conversation in the wiki, under its own page, to document this whole utf-8 thing.

As to the code itself, all that I see that would change would be to add that utf-8 setting after the connection.

Truth is, right now, if people are connecting against a badly set server, the data retrieved from their servers is all screwed up, right?
So changing it right away would be better than waiting for converting, right?

Like I said we need this "discussion" to have more visibility, also because it carries solutions for already affected systems.

Again thank you very much for the research, code and work, I'll get back to you with what we will implement to force the utf-8 connection.

Cheers,

Pedro A.
(0001331)
gajdusek   
2010-10-27 20:09   
Hi,

> 1) The install/upgrade scripts are not prefix compatible, though the code itself is.

Sounds nice that the code is ready.

>About the 1st, well it took some effort to make this thing prefix compatible
>and we need to sort something to make it that way. Could be string replacement
>on the install/upgrade scripts or whatever, and by that I mean another way of doing it.

It will be nice to make some more abstract mechanism for replacing strings in SQL updates. Maybe even make some layer for SQL scripts generators?

>2) We need to handle/fix the connection encoding, because as you mentioned
> the DB engine we support right now is dumb and needs to be educated.

I was wrong about ADODB library. We should switch to 'mysqli', then setting $db->charSet before connection itself will work for mysql too.

>3) Should we worry about systems where their supervisors do not know (or don't
>care) if their encodings are correctly set (and match) system wide.

This is not about system settings, the encodings mismatch is a web2project issue. Keep on reading.

> Now nr 2, I believe we must find a way to set the encoding all right, but I
> prefer a one line of code approach, depending on DB engine, though for now
> we only have to worry about mysql as you correctly said.
> I see that some Dbs react to $db->charSet before the Connect is done on ADODB,
> I see that mysql reacts to $db->EXECUTE("set names 'utf8'"); after the Connect.

Mysqli will react to $db->charSet before the Connect is done on ADODB too.

> I don't see why we should add a w2PConfig for this, because utf-8 is the
> defacto encoding and that's it, period.

Because we could damage data if we simple set $db->charSet('utf8'). This is because we didn't use utf8 for communication with mysql server. Instead we used some 'unknown' character set. It could be utf8, latin1 or something else. We could not know. And we didn't want to know.

And IMO users should be able to use the most proper character set for their usage, it must not be utf8. Another w2PConfig for setting collation is next important thing, without it we will never have correct sorting.

>Truth is, right now, if people are connecting against a badly set server, the
> data retrieved from their servers is all screwed up, right?

No.

> So changing it right away would be better than waiting for converting, right?

IMHO no.

Let me explain a bit more details:

We have database columns character encoding set to utf8 already (006_special_characters.sql) and web2project has no problem with utf8. The missing part is communication between web2project and mysql server.

There are three mysql system variables that define character sets of communication:
1) character_set_client - character set in which statements are sent by the client
2) character_set_connection - to what character set should the server translate a statement after receiving it
3) character_set_results - character set in which the server returns query

All these three are usual set to same value and for particular connection can be changed from PHP code directly (if mysql skip-character-set-client-handshake option is not set).
i.e mysql_set_charset('utf8') or 'set names "utf8"' sql statement set all these three variables to utf8. But mysql_set_charset('utf8') is the preferred way [1]. I will not cover collation here to keep it simple.

Problem is, web2project does not care to tell mysql server in what character set wants to communicate. So now it is mysql's turn.

It will set these variables according to:
1) mysql global or client configuration, if not defined then
2) options passed to mysqld as parameters or
3) compilation options or
4) mysql default - latin1.

Two most common values are utf8 and latin1. Both are entirely proper settings. You cannot say latin1 setting is bad, bad is that web2project does not care to tell mysql server in what charset they will communicate. Web2project is blindly assuming utf8 but MANY servers have latin1 as default setting. Do you know why so many servers have latin1 as default setting? Because there are still so many old/buggy software or sites that do not set connection charset and system-wide change may damage its database content! Yes, web2project is contributing to this unpleasant status.

So now we have two cases:
1) system-wide setting for db connection is utf8 - all is in order RIGHT now.
2) system-wide setting for db connection is latin1 - the retrieved data from server are still correct RIGHT now, but data stored in database are screwed up and sorting (collation) of non-ASCII characters does not work. In fact, sorting will not fully work until we do not change collation of tables/columns to correct language collation, i.e. utf8_czech_ci.

Now,
1) if we migrate (or simply export) our databases to host with another system-wide character set, data will be all screwed up
2) if system administrator decides to change system-wide setting (i.e. from latin1 to utf8) or update system, data will be all screwed up
3) if web2project set connection character set to utf8 right now, data will be all screwed up on systems that have latin1 system-wide.

** SUMMARY **

This is not about badly set servers at all, this is about the bug in web2project. And IMO we should not fix this bug by simple enforcing utf8 connection charset because it may result in data loss if the database content will not be converted before the change.

If you decide, despite my opinion I will try to summarize in further text, to simple force utf8 connection charset, do it, I will not be pushing on you anymore, but please, please, make some BIG FAT warning about that change available to the web2project users.

Simple because it is a bug in web2project.
We shouldn't have been relying on system-wide connection character set because it does not need to be utf8! It can be set i.e. to latin1 and that is entirely proper configuration (forced by history and still existing old or buggy code as currently web2project is), not administrator fault. Web2project connection charset must match the character set of stored data. And we should have ensured this, not system administrators. We did not ensure this in the past and so, I think, we should make everything possible to guarantee database content will not be damaged if we now fix this web2project bug. Otherwise it will be better to do nothing.

There are new hosting servers with utf8 as default mysql charset but there are still ones with latin1. This will not change very soon. I am not web hosting administrator, but if I would be one and would set up a new mysql server I would set utf8 as default charset, but I would probably never change latin1 to utf8 on servers with already existing databases. Administrators of large hosting servers cannot simple change latin1 to utf8 system-wide or migrate sites to new server with utf8 set as default. This will effectively damage sites that do not set db connection charset explicitly. Their databases must be converted first. That is not trivial for sysadmins to achieve, because they must first know what character set for communication with DB has been using by each site. And it is inconvenient for their customers as well, if they should convert it themselves or take a risk their data will be lost. There are many servers with latin1 because of backwards compatibility with old/buggy/legacy projects they are hosting. Consider web2project as one of these. Web2project can be considered as partial reason why so many servers still set latin1 as default charset.

And there are also latin1 as default character set in mysql upstream [2], maybe for same reason.

Please make decision, I am convinced I am right (but who is not?), but I may be wrong (I was wrong many many times already).

After you make decision and changes in the code I will make appropriate article for wiki.

Cheers,
Petr


[1] http://cz.php.net/manual/en/function.mysql-set-charset.php and http://www.mail-archive.com/internals@lists.php.net/msg37958.html
[2] http://dev.mysql.com/doc/refman/5.0/en/charset-applications.html
(0001332)
pedroa   
2010-10-28 02:30   
Hi Petr,

First of all thanks for beating me to a pulp on this one, and please keep doing it because it is being fun. :)

This is yet one more reason I am not too fond of mysql.

If you noticed we have been having a big deal of effort put in this utf8 subject in the past few months.
And the reason this was not being considered such a big thing, was that our primary markets are companies that mainly live with ascii or latin.
But we know that utf8 does not damage those markets and we can make web2project useful to loads more people if we find our way with utf8.

Now having said that, I find it hard to call it a web2Project bug, but rather work in progress.

The way I see it, we should attack this by:
1) Create a wiki page with all this discussion.
2) You have code to make a conversion, right?
Can it detect mysqls server encoding or how the data is encoded and make a conversion?
So lets talk people into making a conversion into utf8, if they require it.
3) Create an option on the system config, a checkbox saying... enforce utf8 data connections to the Database server. With a FAT NOTE pointing to the wiki page, and stating... "IF YOU ACTIVATE THIS OPTION AND HAVEN'T CONVERTED THE DATA ON THE DATABASE TO UTF8, YOUR DATA CHARACTERS WILL MOST PROBABLY END UP SCRAMBLED. YOU HAVE BEEN WARNED!!!
If you are unsure, leave this option unchecked and ask for support on web2Project Forums."
4) If that setting is true:
After line 18 on includes/db_adodb.php we test this setting and then we, depending on the DB engine we do the $db->charSet, or if we are using mysql (not mysqli as you mentioned) we put $db->EXECUTE("set names 'utf8'"); if that config option is enabled.

By default that options is false, or unchecked.

Again, thank you very much for the effort put in this, we appreciate it and you are far from being annoying, don't worry :)

Cheers,

Pedro A.
(0001333)
pedroa   
2010-10-28 03:27   
... Oh and about Collation, sysadmin problem not mine/ours :)

Cheers,

Pedro A.
(0001334)
gajdusek   
2010-10-28 11:30   
(Last edited: 2010-10-28 15:18)
Hi,

I agree mostly.

>2) You have code to make a conversion, right?
>Can it detect mysqls server encoding or how the data is encoded and make a conversion?

It detects system-wide connection charset, if it is already utf8 then no conversion is needed, if it differs from utf8, SQL statements needed to utf8 conversion are printed.

> 3) Create an option on the system config, a checkbox saying... enforce utf8
> data connections to the Database server.

I would not make it available in System admin/System configuration page. It will raise even more problems.

Just place it in config-dist.php set to true (for fresh installations using mysql >= 5.0.7) and advice users to set it to utf8 in config.php after they convert existing database. In the code check if this option is defined, if not consider it to be false. Setting it true to config-dist.php is ok until we are setting utf8 connection charset to mysql only.

Optimal would be adding this option to config.php (with appropriate comment) when updating code and make it true if mysql database is used and system default connection charset is utf8 already or make it false otherwise. But I have found no support for this in current code. And config.php must not be writable. So it is probably bad idea.

> 4) If that setting is true:
> After line 18 on includes/db_adodb.php we test this setting and then we,
> depending on the DB engine we do the $db->charSet, or if we are using mysql
> (not mysqli as you mentioned) we put $db->EXECUTE("set names 'utf8'");
> if that config option is enabled.

I would put (conditional variant of) $db->EXECUTE("set names 'utf8'"); only for mysql (after connection is accomplished), don't bother with other databases right now. $db-charSet will work only for oci8 and ibase drivers. I would wait with implementing this.

This solution is long way to be optimal but it is the most simple I can imagine.

==============

To final robust solution for all databases we should take following research and ideas in consideration:

I check the ADODB code through and must correct my statement about mysqli:
Setting $db->charSet before the connection is working only for 'ibase' and 'oci8' drivers (in ADODB version we use) and 'sybase' (in current upstream ADODB). 'mysqli' and 'postgres7' drivers have SetCharSet() methods implemented, they can be used to set encoding after connection was made. This is probably because mysqli does not allow specify connection charset when making connection. This method is not implemented for mysql - these ADODB drivers' methods are just wrappers to appropriate PHP modules (i.e. mysqli) and mysql PHP module does not have implemented this.

So we can force connection charset to oci8, ibase and sybase (if we update ADDODB to new version) drivers by setting $db->charSet('utf8') before actual connection happens. After connection has accomplished we can change charset for mysqli and postgres7 by $db->SetCharSet('utf8') method call. And finally we can force charset to mysql by mysql_set_charset('utf8',$db->_connectionID) and other databases in their specific way.

Right now I would change this only for mysql as I stated above. BTW we should consider use mysqli (MySQL Improved Extension) rather then mysql which is designed for mysql older then 4.1. See [1].

For mysql we should use either mysql_set_charset('utf8',$db->_connectionID) or execute an appropriate sql statement as you have proposed. Though, mysql_set_charset() is preferred way because it returns false if failed and sets internal mysql->charset properly (I am not sure what it means, but mysql_client_encoding() will not register change otherwise and it has probably something to do with mysql_real_escape_string() ). See link [1] bottom of my previous post. But that function is implemented only in PHP >= 5.2.3. Therefore we could simple make $db->EXECUTE("set names 'utf8'") as you proposed or use preferred mysql_set_charset('utf8',$db->_connectionID) and add this function in backcompat_functions.php. This way it is done in my patch.

Other databases charset should be implemented when we will fully support them in the code so we can make testing.

So, yes , set $db->charSet before connection is made, this will work only for ibase and oci8 drivers but can never hurt. After the connection is initiated call $db->SetCharSet() method if it is implemented. If it is not, set connection encoding in database driver specific way.

[1] http://cz.php.net/manual/en/intro.mysqli.php

Cheers,
petr

(0001335)
gajdusek   
2010-10-28 12:31   
> ... Oh and about Collation, sysadmin problem not mine/ours :)

Nope :) But it is rather off topic to this bugreport. I will create new one :)

Cheers
(0001336)
gajdusek   
2010-10-28 14:51   
(Last edited: 2010-10-28 15:20)
I realize we should not make config option that will force all databases connection charset to utf8. Because if we do it, we must ensure charset setting will immediately work for every database driver we will support in future.

So we must ensure all this thing will apply only to mysql right now:

1) Add 'mysql' case to db_connect() switch block - only here we should set connection character set.

this should be something like

if ( !empty( w2PgetConfig( 'dbconnection_charset' ) ) ) {
    if ( function_exists( 'mysql_set_charset' ) ) {
        mysql_set_charset( 'utf8', $db->_connectionID );
    } else {
        $db->Execute( 'SET NAMES "utf8"' );
    }
}

or better, just:

if ( !empty( w2PgetConfig( 'dbconnection_charset' ) ) ) {
    mysql_set_charset( 'utf8', $db->_connectionID );
}

and add this to includes/backcompat_functions.php:

// can be removed when we switch to PHP >= 5.2.3
if (function_exists('mysql_set_charset') === false) {
    /**
     * Sets the client character set.
     *
     * Note: This function requires MySQL 5.0.7 or later.
     *
     * @see http://www.php.net/mysql-set-charset
     * @param string $charset A valid character set name
     * @param resource $link_identifier The MySQL connection
     * @return TRUE on success or FALSE on failure
     */
    function mysql_set_charset($charset, $link_identifier = null)
    {
        if ($link_identifier == null) {
            return mysql_query('SET NAMES "'.$charset.'"');
        } else {
            return mysql_query('SET NAMES "'.$charset.'"', $link_identifier);
        }
    }
}

Though, a bit refactoring of db_connect function will be appropriate, i.e. moving connection code from switch cases before switch block (that code is repeated in all cases).

Now if somebody adds $w2PConfig['dbconnection_charset'] = true to config.php, utf8 will be used for connection.

As addition to this we should add default (=false) value to config-dist.php. So we can place there a comment.

Better, if we set it to DBCONNECTION_CHARSET string that will be replaced by installer to false or, in case mysql will be chosen and it will be >= 5.0.7 version, to true. So fresh installations will be using utf8.


2) So something like this to config-dist.php

//set this value to true to use UTF8 for database connection, keep it false to use system default
$w2PConfig['dbconnection_charset'] = DBCONNECTION_CHARSET;

along with a FAT NOTE pointing to the wiki page, and stating...
1) KEEP THIS OPTION FALSE IF YOU DO NOT USE MYSQL >= 5.0.7 AS YOUR DATABASE.
2) IF YOU SET THIS TO TRUE AND HAVEN'T CONVERTED THE DATA ON THE DATABASE TO UTF8, YOUR DATA CHARACTERS WILL MOST PROBABLY END UP SCRAMBLED.
YOU HAVE BEEN WARNED!!!

3) Installation script will replace DBCONNECTION_CHARSET with FALSE or with TRUE if mysql database is used and if it is new enough (version check).

That's all.
This way we can add support for other database drivers later with no pain.

There are two more functions connecting directly to DB, please, consider if they must be changed too.

They are in classes/w2p/Core/UpgradeManager.class.php and install/manager.class.php

PS: I did not test that code parts I added into this post.

Cheers,
Petr

(0001375)
caseydk   
2010-11-26 07:50   
Wow, you guys talk a lot. ;)

Some quick points:

1) I've logged the "installer should support table prefixes" as 0000648.

2) For new installs, the utf-8 tables should be easy to support. We can add the mysql version check and enable the checkbox if it's 5.0.7 or above. Someone please file a request on that one with all relevant details.

3) For existing installs, this will be trouble.. having a one-time switch in the System Admin seems like a reasonable option *IF* we can detect if the character problem exists in their database:
* If it's just a setting problem but not data, let's change the setting, no problem.
* If there's a problem with their data, I'd prefer to show the disabled setting box with an explanation. I don't want to take responsibility for breaking their stuff.

4) Updating the minimum PHP version can only happen on Major releases (each June). I'm *guessing* we should go to 5.2.6 or 5.2.10 by that time. In the meantime, using the backcompat_functions is the way to go. Nice catch gajdusek.

5) If *either* of you find MySQL specific commands, etc outside the install/sql/mysql directory, could you mark those with something like:

// TODO: mysql specific

I'd like to have those findable as I'm digging through SQL Server support now.
(0001381)
gajdusek   
2010-11-26 10:36   
Hi, please look at https://github.com/gajdusek/web2project/commits/issue598

It is my patch covering this issue: support of using UTF8 as database connection charset.

**** Short summary ****

It uses new $w2Pconfig['dbconnection_charset'] setting in config.php to determine whether force utf8 database connection.

There is no checkbox on installation time, instead this option is autodetected during fresh install or d2p migration. Detection is performed by the w2p_Core_UpgradeManager::DBConnectionCharsetSafetyCheck() method. This could be safely autodetected on upgrades as well, but I've found no support of changing already existing config.php.

IMHO there is no need for checkbox. If it is safe to use utf8 connection then it should be used, because w2p code assumes it already. There is no explanation why this option was detected to be unsafe (or not implemented for used db type), I have tried implement this but gave up. I remember no more why exactly. Instead, explanation will appear in sysadmin module (this way the already existing installations are covered too), but only if some user action can solve the issue (i.e. converting database). Nothing will be displayed if $w2Pconfig['dbconnection_charset'] is already enabled or mysql version is too old or unsupported database type is used.

*** caveeats ***

- Only mysql is supported right now
  * the patch has no problem if unsupported DB is used
  * support for other database can be easily and safety added anytime later

- Database converting is not implemented (but external script exists)
- Database conversion and utf8 connection enabling must be atomic
- Corresponding wiki pages do not exist yet
(0002253)
caseydk   
2011-10-16 12:59   
Sorry that I didn't see this one get updated.. wow.

Is this patch still applicable?
(0002368)
caseydk   
2011-12-27 23:08   
I'm looking to roll this into the v3.0 release here shortly.
(0003505)
caseydk   
2014-06-21 20:08   
Resolved in v4.0 development:
https://github.com/web2project/web2project/commit/e39e9dc0ee687d0eba4c145e3703e54ea5400495
(0003759)
opto   
2015-09-01 12:21   
unfortunately, in 3.3, the patch for 4 does not work, I have not yet managed to get my old 3.2 db display correctly in 3.3 - all db haractersets etc. seem to be ok.

Issue History
2010-10-09 10:10gajdusekNew Issue
2010-10-19 11:27pedroaStatusnew => assigned
2010-10-19 11:27pedroaAssigned To => pedroa
2010-10-19 18:19gajdusekNote Added: 0001314
2010-10-19 18:20gajdusekFile Added: dbcharset_support.patch
2010-10-19 18:44gajdusekNote Edited: 0001314
2010-10-26 06:28pedroaNote Added: 0001318
2010-10-26 06:28pedroaStatusassigned => feedback
2010-10-26 06:32pedroaNote Added: 0001319
2010-10-26 06:33pedroaRelationship addedrelated to 0000587
2010-10-26 17:17gajdusekNote Added: 0001328
2010-10-26 17:42gajdusekNote Edited: 0001328
2010-10-26 18:24gajdusekNote Added: 0001329
2010-10-26 18:47gajdusekNote Edited: 0001328
2010-10-26 18:59gajdusekNote Edited: 0001329
2010-10-27 08:18pedroaNote Added: 0001330
2010-10-27 20:09gajdusekNote Added: 0001331
2010-10-28 02:30pedroaNote Added: 0001332
2010-10-28 03:27pedroaNote Added: 0001333
2010-10-28 11:30gajdusekNote Added: 0001334
2010-10-28 12:29gajdusekNote Edited: 0001334
2010-10-28 12:31gajdusekNote Added: 0001335
2010-10-28 14:51gajdusekNote Added: 0001336
2010-10-28 14:59gajdusekNote Edited: 0001336
2010-10-28 15:07gajdusekNote Edited: 0001334
2010-10-28 15:12gajdusekNote Edited: 0001334
2010-10-28 15:18gajdusekNote Edited: 0001334
2010-10-28 15:20gajdusekNote Edited: 0001336
2010-11-24 23:35caseydkProjectv2.1 Release (Closed) => v2.2 Release (Closed)
2010-11-25 02:48pedroaRelationship addedchild of 0000632
2010-11-26 07:50caseydkNote Added: 0001375
2010-11-26 10:36gajdusekNote Added: 0001381
2010-12-15 21:55caseydkProjectv2.2 Release (Closed) => v2.3 Release (Closed)
2011-02-23 22:47caseydkRelationship addedparent of 0000731
2011-02-27 15:10caseydkRelationship replacedhas duplicate 0000731
2011-03-22 12:04caseydkProjectv2.3 Release (Closed) => v2.4 Release (Closed)
2011-07-18 12:23caseydkProjectv2.4 Release (Closed) => v3.0 Release
2011-10-16 12:59caseydkNote Added: 0002253
2011-12-27 23:08caseydkNote Added: 0002368
2011-12-27 23:08caseydkStatusfeedback => assigned
2011-12-27 23:08caseydkAssigned Topedroa => caseydk
2012-01-01 23:03caseydkProjectv3.0 Release => Pending Requests
2012-12-18 22:39caseydkProjectPending Requests => v3.0 Release
2012-12-18 22:51caseydkRelationship addedrelated to 0001227
2012-12-22 18:13caseydkProjectv3.0 Release => Pending Requests
2014-01-21 17:56caseydkTarget Version => 4.x
2014-02-27 00:34caseydkProduct Versionv2.1 =>
2014-02-27 00:34caseydkTarget Version4.x =>
2014-04-05 17:24caseydkCategoryGeneral => Core Infrastructure
2014-05-20 19:46caseydkProjectPending Requests => v4.0 Release (Planning)
2014-06-10 22:49caseydkTarget Version => 4.0
2014-06-21 20:08caseydkNote Added: 0003505
2014-06-21 20:08caseydkStatusassigned => resolved
2014-06-21 20:08caseydkFixed in Version => 4.0
2014-06-21 20:08caseydkResolutionopen => fixed
2015-09-01 12:21optoNote Added: 0003759
2015-09-01 12:21optoStatusresolved => feedback
2015-09-01 12:21optoResolutionfixed => reopened
2016-12-26 10:25caseydkProjectv4.0 Release (Planning) => Pending Requests
2016-12-28 00:17caseydkRelationship addedrelated to 0001671