Anonymous Login
2019-07-21 10:23 PDT

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0000598Pending RequestsCore Infrastructurepublic2016-12-28 00:17
Reportergajdusek 
Assigned Tocaseydk 
PrioritynormalSeveritymajorReproducibilityalways
StatusfeedbackResolutionreopened 
Product Version 
Target VersionFixed in Version 
Summary0000598: adodb-mysql does not set default character set for db connection, this can lead in unusable database content
DescriptionHi,
adodb-mysql.inc.php functions _connect() and _pconnect() do not set default character set for database connections nor does web2project code elsewhere.

This results in wrong encoding of non-ascii strings stored in database depending on host configuration - to be specific if system wide mysql default_character_set differs from utf8, i.e. debian's default value is 'latin1'.

Wrong character encoding has at least following critical consequences:

* Sorting of strings with non-ascii charactes does probably not work correctly.
* Strings will be totaly messed up if mysql's default_character_set will be changed or database will be migrated to host with different configuration.

Adding mysql_set_charset('utf8') call to both mentioned functions or setting this later from web2Project code (if adodb modification is not an option) will solve this. This solution will have a side effect of messing up strings already stored in wrong encoding.
TagsNo tags attached.
Attached Files
  • patch file icon dbcharset_support.patch (7,252 bytes) 1969-12-31 16:00 -
    From 0848eb0acf2f5f354a08ea655b396bf90b2fc28a Mon Sep 17 00:00:00 2001
    From: gajdusek <Petr Gajdusek gajdusek.petr@centrum.cz>
    Date: Wed, 20 Oct 2010 00:38:20 +0200
    Subject: [PATCH] Added dbcharset configuration option to config.php determining database
     connection charset.  Defaults to 'utf8' on fresh instalation.  If this
     option is not empty web2Project uses given charset for every database
     connection.  If empty, web2project uses site-wide charset for database
     connection as before these commit.
    
    If your database connection charset is already utf8 you can safely
    change this option to utf8.  Otherwise you must first convert
    all strings stored in database to utf8 and then immediately
    set this option to utf8.  You will end with damaged datase
    if tables will be updated by any web2Project user after you
    convert strings to utf8 and before you set 'dbcharset' option
    to utf8.  Scripts I use for conversion can be obtained from
    http://github.com/gajdusek/web2Project-tools/tree/master/db_convert_to_utf8/
    ---
     classes/w2p/Core/UpgradeManager.class.php |    6 ++++++
     includes/backcompat_functions.php         |   23 ++++++++++++++++++++++-
     includes/config-dist.php                  |    4 ++++
     includes/db_adodb.php                     |   12 +++++++++---
     install/manager.class.php                 |    6 ++++++
     5 files changed, 47 insertions(+), 4 deletions(-)
    
    diff --git a/classes/w2p/Core/UpgradeManager.class.php b/classes/w2p/Core/UpgradeManager.class.php
    index 23537d5..f8b98e6 100644
    --- a/classes/w2p/Core/UpgradeManager.class.php
    +++ b/classes/w2p/Core/UpgradeManager.class.php
    @@ -304,8 +304,14 @@ class w2p_Core_UpgradeManager {
             try {
                 $db = NewADOConnection($this->configOptions['dbtype']);
                 if(!empty($db)) {
    +              /* set client charset for Interbase/Firebird, Oracle, sybase, mysqli */
    +              if (!empty($this->configOptions['dbcharset'])) $db->charSet = $this->configOptions['dbcharset'];
                   $dbConnection = $db->Connect($this->configOptions['dbhost'], $this->configOptions['dbuser'], $this->configOptions['dbpass']);
                   if ($dbConnection) {
    +                /* set client charset for mysql */
    +                if ($this->configOptions['dbtype'] == 'mysql' && !empty($this->configOptions['dbcharset'])) {
    +                    mysql_set_charset($this->configOptions['dbcharset'],$db->_connectionID);
    +                }
                     $existing_db = $db->SelectDB($this->configOptions['dbname']);
                     if (!$existing_db) {
                       $db->_errorMsg = 'This database user does not have rights to the database.';
    diff --git a/includes/backcompat_functions.php b/includes/backcompat_functions.php
    index 343b476..fea1499 100644
    --- a/includes/backcompat_functions.php
    +++ b/includes/backcompat_functions.php
    @@ -159,4 +159,25 @@ if (!function_exists('htmlspecialchars_decode')) {
     	function htmlspecialchars_decode($str) {
     		return strtr($str, array_flip(get_html_translation_table(HTML_SPECIALCHARS)));
     	}
    -}
    \ No newline at end of file
    +}
    +
    +if (function_exists('mysql_set_charset') === false) {
    +    /**
    +     * Sets the client character set.
    +     *
    +     * Note: This function requires MySQL 5.0.7 or later.
    +     *
    +     * @see http://www.php.net/mysql-set-charset
    +     * @param string $charset A valid character set name
    +     * @param resource $link_identifier The MySQL connection
    +     * @return TRUE on success or FALSE on failure
    +     */
    +    function mysql_set_charset($charset, $link_identifier = null)
    +    {
    +        if ($link_identifier == null) {
    +            return mysql_query('SET NAMES "'.$charset.'"');
    +        } else {
    +            return mysql_query('SET NAMES "'.$charset.'"', $link_identifier);
    +        }
    +    }
    +}
    diff --git a/includes/config-dist.php b/includes/config-dist.php
    index 22cefeb..ca37bb2 100644
    --- a/includes/config-dist.php
    +++ b/includes/config-dist.php
    @@ -26,6 +26,10 @@ $w2Pconfig['dbprefix'] = '[DBPREFIX]'; // The prefix used for db table names
     // set this value to true to use persistent database connections
     $w2Pconfig['dbpersist'] = false;
     
    +// set this to the database client character set, 
    +// if empty site-wide configuration will be used
    +$w2Pconfig['dbcharset'] = 'utf8';
    +
     /***************** Configuration for DEVELOPERS use only! ******/
     // Root directory and base_url are automatically set to avoid
     // getting them wrong.  They are not the same.
    diff --git a/includes/db_adodb.php b/includes/db_adodb.php
    index 2f28ade..7ffbb18 100644
    --- a/includes/db_adodb.php
    +++ b/includes/db_adodb.php
    @@ -17,9 +17,12 @@ require_once (W2P_BASE_DIR . '/lib/adodb/adodb.inc.php');
     
     $db = NewADOConnection(w2PgetConfig('dbtype'));
     
    -function db_connect($host = 'localhost', $dbname, $user = 'root', $passwd = '', $persist = false) {
    +function db_connect($host = 'localhost', $dbname, $user = 'root', $passwd = '', $persist = false, $dbcharset = '') {
     	global $db, $ADODB_FETCH_MODE;
     
    +    /* set client charset for Interbase/Firebird, Oracle, sybase, mysqli */ 
    +	if (!empty($dbcharset)) $db->charSet = $dbcharset;  
    +	
     	switch (strtolower(trim(w2PgetConfig('dbtype')))) {
     		case 'oci8':
     		case 'oracle':
    @@ -37,8 +40,11 @@ function db_connect($host = 'localhost', $dbname, $user = 'root', $passwd = '',
     			} else {
     				$db->Connect($host, $user, $passwd, $dbname) or die('FATAL ERROR: Connection to database server failed');
     			}
    +			if (!empty($dbcharset)) {
    +			    mysql_set_charset($dbcharset,$db->_connectionID);
    +			}
     	}
    -
    +	
     	$ADODB_FETCH_MODE = ADODB_FETCH_BOTH;
     }
     
    @@ -179,7 +185,7 @@ function db_dateTime2unix($time) {
     }
     
     // make the connection to the db
    -db_connect(w2PgetConfig('dbhost'), w2PgetConfig('dbname'), w2PgetConfig('dbuser'), w2PgetConfig('dbpass'), w2PgetConfig('dbpersist'));
    +db_connect(w2PgetConfig('dbhost'), w2PgetConfig('dbname'), w2PgetConfig('dbuser'), w2PgetConfig('dbpass'), w2PgetConfig('dbpersist'), w2PgetConfig('dbcharset'));
     
     /*
     * Having successfully established the database connection now,
    diff --git a/install/manager.class.php b/install/manager.class.php
    index 8f36804..2e62386 100644
    --- a/install/manager.class.php
    +++ b/install/manager.class.php
    @@ -295,8 +295,14 @@ private function _getDatabaseVersion($dbConn) {
             try {
                 $db = NewADOConnection($this->configOptions['dbtype']);
                 if(!empty($db)) {
    +              /* set client charset for Interbase/Firebird, Oracle, sybase, mysqli */
    +              if (!empty($this->configOptions['dbcharset'])) $db->charSet = $this->configOptions['dbcharset'];
                   $dbConnection = $db->Connect($this->configOptions['dbhost'], $this->configOptions['dbuser'], $this->configOptions['dbpass']);
                   if ($dbConnection) {
    +                /* set client charset for mysql */
    +                if ($this->configOptions['dbtype'] == 'mysql' && !empty($this->configOptions['dbcharset'])) {
    +                    mysql_set_charset($this->configOptions['dbcharset'],$db->_connectionID);
    +                }
                     $existing_db = $db->SelectDB($this->configOptions['dbname']);
                     if (!$existing_db) {
                       $db->_errorMsg = 'This database user does not have rights to the database.';
    -- 
    1.7.1
    
    
    patch file icon dbcharset_support.patch (7,252 bytes) 1969-12-31 16:00 +

-Relationships
has duplicate 0000731closedcaseydk v2.3 Release (Closed) Support of UTF-8 in Database 
related to 0000587feedbackpedroa Pending Requests mysql can't read chinese 
related to 0001227closedcaseydk v3.0 Release After the update Incorrect code page of the base 
related to 0001671new v3.5 Release (Development) utf8 not handled correctly on 3.2->3.3 and mysql -> 5.5 
child of 0000632closedpedroa v2.3 Release (Closed) Print function produces wrong charaters 
+Relationships

-Notes

~0001314

gajdusek (reporter)

Last edited: 2010-10-19 18:44

Hi pedroa, I have made a patch for my usage.

Changes summary:
* Added support for $w2Pconfig['dbcharset'] option in config.php.
* If this option is not empty web2project uses specified charset for database connection, otherwise old behavior is preserved (no connection charset setting)
* $w2Pconfig['dbcharset'] = 'utf8' added to config-dist.php so fresh installations will use correct charset.

Patch applies to 25e55c2ac8b7c9c3279ea93c1ab2818f7d8afc63 commit from http://github.com/caseysoftware/web2project.git
You can get the patch from http://github.com/gajdusek/web2project/commit/0848eb0acf2f5f354a08ea655b396bf90b2fc28a too.
If you decide to use this please check it well because I am really web2project beginner.

Scripts (bash and php versions) that generate SQL statements for database strings conversion to utf8 charset are located at http://github.com/gajdusek/web2Project-tools/tree/master/db_convert_to_utf8/

Intented usage:
* update web2project code
* convert database and imediately add $w2Pconfig['dbcharset'] = 'utf8' to config.

PHP version of the conversion script uses database credentials form config.php and scans tables in database for possibly affected columns. If system-wide connection charset is already utf8 script prints advice to change $w2Pconfig['dbcharset'] = 'utf8' and exits. Otherwise SQL statements for conversion are printed with additional information (and warnings) how finish the migration process.

petr

~0001318

pedroa (administrator)

Hi Petr and thanks a lot for your report and patches.

Now lets resume what we want and where we are at, and what more needs to be done to get there.

So... we want utf-8 to be generally applicable to web2Project.

Now, what do we have at the moment...
1) The locales are encoded utf-8
2) The forms for the interfaces are encoded utf-8
3) The HTML headers are set to be encoded utf-8
4) There is loads of code inside we2bProject that specifically use php mb_ functions to not break multibyte characters and therefore utf-8 compatible.
5) Our sql scripts, specifically install/sql/mysql/006_special_characters.sql converts the tables to uft-8 encoding

Having said this, is there a need to keep enforcing it, or having the tables defined as "SET utf8 COLLATE utf8_general_ci" is enough for the mysql engine itself behave?

What I means is... I don't really think this is an issue considering that the tables themselves are already set to be utf-8 compatible from within mysql itself.

Thing is I don't want to be introducing mysql specific code inside web2Project even more than we already have, because when we are to add Oracle or SQL server or Firebird or PostgreSQL support in web2Project we are going to face more headaches.

What do you think?

Cheers,

Pedro A.


 

~0001319

pedroa (administrator)

Oh by the way,

Have you seen http://bugs.web2project.net/view.php?id=587

What do you think about it?
I can live with adding a line of code (for mysql only) :)

Cheers,

Pedro A.

~0001328

gajdusek (reporter)

Last edited: 2010-10-26 18:47

Hi Pedro,

> Having said this, is there a need to keep enforcing it, or having the tables
> defined as "SET utf8 COLLATE utf8_general_ci" is enough for the mysql engine
> itself behave?

We need to set database connection charset for several reasons. I will try to explain it later in my post.

> 5) Our sql scripts, specifically install/sql/mysql/006_special_characters.sql
> converts the tables to utf-8 encoding

I will skip consequences of this update script and keep focus on current state.
We have all columns marked as utf8 in database already. But we don't know db connection charset because we don't set it. Instead, system-wide setting is used.

My system-wide setting for db connection is latin1. Replace 'latin1' with "user's system-wide database connection charset setting" further in text. Imagine situation that web2project inserts or updates some db record:

1) Web2project sends utf8 string to db server
2) DB server saw it as latin1 (connection charset)
3) Character set of the column is utf8, so db server converts that string (that is in utf8 already) from latin1 to utf8. There are no problems visible in web2project because of reverse process when querying db. But there are 'hidden' problems:
  a) Strings are unreadable in database.
  b) Sorting probably does not work correctly.
  c) If system-wide connection charset will be changed (i.e. by system
     administrator or by server update - btw default MSSQL connection charset was
     changed from utf8 to latin1 in debian unstable this week - or setting
     connection charset in w2p directly) all goes wrong. See next step.
4) If we set db connection charset to utf8 now, we will receive damaged strings from db server.

So before connection charset will be changed (either by setting it by web2project or by system-wide setting change) user must convert database content:

1) convert affected column from utf8 back to latin1 charset
i.e.
ALTER TABLE `users` CHANGE `user_username` `user_username` varchar(255) CHARACTER SET `latin1`;

2) change this column charset to utf8 but make sure no conversion will occur.
i.e.
ALTER TABLE `users` CHANGE `user_username` `user_username` varbinary(255);
ALTER TABLE `users` CHANGE `user_username` `user_username` varchar(255) CHARACTER SET `utf8`;
The reason this works is that there is no conversion when you convert to or from BINARY columns. [1]

3) Repeat 1) and 2) for every affected column in database.
4) Rebuild indexes of converted tables.
5) Set connection charset to utf-8.

This five conversion steps must be done at once and be atomic (web2project should be down for maintenance).

So this is why it is not sufficient to simply change database columns charsets (as it is now) nor forcibly set charset of database connection. Users should convert their databases before and only then (but immediately) change connection charset. I design my patch in this manner. It does nothing until users leave $w2Pconfig['dbcharset'] unset. Scripts generating SQL statements for conversion are available at http://github.com/gajdusek/web2Project-tools/tree/master/db_convert_to_utf8/. Conversion script can not be easily and safely integrated into the web2project update code. I tried it, but it involves more code changes and is probably not reliable (connection timeout).

And now regarding http://bugs.web2project.net/view.php?id=587, I have not seen it, thanks, i will take a look...

Ok, that sets connection charset for mysql via mysql_query("SET NAMES 'utf8'"); in includes/db_adodb.php : db_connect() function.

I see these problems:

1) SET NAMES sql query is discouraged in PHP documentation [2]
2) mysql specific thing
3) We cannot force db connection charset to utf-8, because users must first convert (or check) their tables.
4) There are more functions connecting to database directly: classes/w2p/Core/UpgradeManager.class.php and install/manager.class.php

Please take a look at patch I sent on 2010-10-19 or http://github.com/gajdusek/web2project/commit/0848eb0acf2f5f354a08ea655b396bf90b2fc28a. It solves all these points:

1) Using set_mysql_charset() function as the preferred way to set connection charset and adding compatible function that uses SET NAMES to includes/backcompat_functions.php, because set_mysql_charset() is implemented only in PHP >= 5.2.3 [2].
This all will probably work only in MySQL >= 5.0.7 or, if you're using MySQL 4, then >= 4.1.13. [3], is this an issue?

2) ADODB has implemented charSet property for setting client charset. It supports i.e. Interbase/Firebird, Oracle, sybase and mysqli (according to documentation) but not mysql. My patch set this for all databases and calls set_mysql_charset extra for mysql database. So my patch is not mysql specific, but i did not test its functionality for other databases then mysql.

3) This is why I added $w2Pconfig['dbcharset'] and set connection charset only if user set this variable to something (i.e. utf8). Only for new installations connection charset will be used by default ($w2Pconfig['dbcharset'] = 'utf8' in includes/config-dist.php)

4) set database connection charset code added to:
includes/db_adodb.php:db_connect()
classes/w2p/Core/UpgradeManager.class.php
install/manager.class.php

I did not found any other code connecting to DB directly.

[1] http://codex.wordpress.org/Converting_Database_Character_Sets
[2] http://cz.php.net/manual/en/function.mysql-set-charset.php
[3] http://cz.php.net/manual/en/function.mysql-set-charset.php#80997

PS: sorry for many updates of this post, just correcting errors in my English

Cheers,
petr

~0001329

gajdusek (reporter)

Last edited: 2010-10-26 18:59

Attached patch does not (hopefully) affect anything until you leave $w2Pconfig['dbcharset'] unset or do not fresh web2project install.

If you set $w2Pconfig['dbcharset'] to utf8 (or other charset) some requirements must fulfilled:

1) Database columns charset (utf8) and encoding of strings stored in them must match (satisfied on fresh installs or if system-wide database connection charset was already set to utf-8, otherwise database conversion must precede).

2) Code must be always succesfull in setting new connection charset before retrieving or updating data in database (or system-wide db connection charset must match $w2Pconfig['dbcharset'] i.e. utf8).

Ad 2)

a) This cannot be satisfied with mysql server < 5.0.7, so minimal requirements must be adjusted. There is mysql 5.0 requirement in the wiki [1], this must be raised to 5.0.7. Older 5.0.x versions do not support 'SET NAMES' query. And/or make advice to ensure that site-wide mysql connection charset is set to utf8, if applicable.

b) Make sure connection charset setting works well for other databases than mysql. i.e. check support in ADODB or insert custom code as for mysql. I have written no conversion scripts for other databases neither (I use only mysql). This does not seem to be problem now because web2project code appears totally unaware about other databases. There should be no problem with fresh installs using other database servers if ADODB can set connection charset via charSet property.

Converting of tables with existing data should be described in some article on the wiki. I just want to repeat: until user will not set $w2Pconfig['dbcharset'] variable or make fresh install nothing will happen. Finally, I can write some article about converting database content, only if somebody could correct my horrible English.

That reminds me: Sorry for my English!

Cheers,
petr

[1] http://wiki.web2project.net/index.php?title=Installation#Database_Server

~0001330

pedroa (administrator)

Hi Petr,

First of all thanks a lot for all this research, because it is very well done.

You touched some sensitive points there and are:
1) The install/upgrade scripts are not prefix compatible, though the code itself is.
2) We need to handle/fix the connection encoding, because as you mentioned the DB engine we support right now is dumb and needs to be educated.
3) Should we worry about systems where their supervisors do not know (or don't care) if their encodings are correctly set (and match) system wide.

About the 1st, well it took some effort to make this thing prefix compatible and we need to sort something to make it that way. Could be string replacement on the install/upgrade scripts or whatever, and by that I mean another way of doing it.

Now nr 2, I believe we must find a way to set the encoding all right, but I prefer a one line of code approach, depending on DB engine, though for now we only have to worry about mysql as you correctly said.
I see that some Dbs react to $db->charSet before the Connect is done on ADODB, I see that mysql reacts to $db->EXECUTE("set names 'utf8'"); after the Connect.
And yes we go and simply change the wiki to the mysql version necessarily.
I don't see why we should add a w2PConfig for this, because utf-8 is the defacto encoding and that's it, period.

As to 3, too bad if they don't care, if they don't care, why should we?
But I appreciate your effort into creating a conversion procedure, but why should we need a conversion procedure when all you needed to do was to make sure everything matched?

It would bee very nice to have this conversation in the wiki, under its own page, to document this whole utf-8 thing.

As to the code itself, all that I see that would change would be to add that utf-8 setting after the connection.

Truth is, right now, if people are connecting against a badly set server, the data retrieved from their servers is all screwed up, right?
So changing it right away would be better than waiting for converting, right?

Like I said we need this "discussion" to have more visibility, also because it carries solutions for already affected systems.

Again thank you very much for the research, code and work, I'll get back to you with what we will implement to force the utf-8 connection.

Cheers,

Pedro A.

~0001331

gajdusek (reporter)

Hi,

> 1) The install/upgrade scripts are not prefix compatible, though the code itself is.

Sounds nice that the code is ready.

>About the 1st, well it took some effort to make this thing prefix compatible
>and we need to sort something to make it that way. Could be string replacement
>on the install/upgrade scripts or whatever, and by that I mean another way of doing it.

It will be nice to make some more abstract mechanism for replacing strings in SQL updates. Maybe even make some layer for SQL scripts generators?

>2) We need to handle/fix the connection encoding, because as you mentioned
> the DB engine we support right now is dumb and needs to be educated.

I was wrong about ADODB library. We should switch to 'mysqli', then setting $db->charSet before connection itself will work for mysql too.

>3) Should we worry about systems where their supervisors do not know (or don't
>care) if their encodings are correctly set (and match) system wide.

This is not about system settings, the encodings mismatch is a web2project issue. Keep on reading.

> Now nr 2, I believe we must find a way to set the encoding all right, but I
> prefer a one line of code approach, depending on DB engine, though for now
> we only have to worry about mysql as you correctly said.
> I see that some Dbs react to $db->charSet before the Connect is done on ADODB,
> I see that mysql reacts to $db->EXECUTE("set names 'utf8'"); after the Connect.

Mysqli will react to $db->charSet before the Connect is done on ADODB too.

> I don't see why we should add a w2PConfig for this, because utf-8 is the
> defacto encoding and that's it, period.

Because we could damage data if we simple set $db->charSet('utf8'). This is because we didn't use utf8 for communication with mysql server. Instead we used some 'unknown' character set. It could be utf8, latin1 or something else. We could not know. And we didn't want to know.

And IMO users should be able to use the most proper character set for their usage, it must not be utf8. Another w2PConfig for setting collation is next important thing, without it we will never have correct sorting.

>Truth is, right now, if people are connecting against a badly set server, the
> data retrieved from their servers is all screwed up, right?

No.

> So changing it right away would be better than waiting for converting, right?

IMHO no.

Let me explain a bit more details:

We have database columns character encoding set to utf8 already (006_special_characters.sql) and web2project has no problem with utf8. The missing part is communication between web2project and mysql server.

There are three mysql system variables that define character sets of communication:
1) character_set_client - character set in which statements are sent by the client
2) character_set_connection - to what character set should the server translate a statement after receiving it
3) character_set_results - character set in which the server returns query

All these three are usual set to same value and for particular connection can be changed from PHP code directly (if mysql skip-character-set-client-handshake option is not set).
i.e mysql_set_charset('utf8') or 'set names "utf8"' sql statement set all these three variables to utf8. But mysql_set_charset('utf8') is the preferred way [1]. I will not cover collation here to keep it simple.

Problem is, web2project does not care to tell mysql server in what character set wants to communicate. So now it is mysql's turn.

It will set these variables according to:
1) mysql global or client configuration, if not defined then
2) options passed to mysqld as parameters or
3) compilation options or
4) mysql default - latin1.

Two most common values are utf8 and latin1. Both are entirely proper settings. You cannot say latin1 setting is bad, bad is that web2project does not care to tell mysql server in what charset they will communicate. Web2project is blindly assuming utf8 but MANY servers have latin1 as default setting. Do you know why so many servers have latin1 as default setting? Because there are still so many old/buggy software or sites that do not set connection charset and system-wide change may damage its database content! Yes, web2project is contributing to this unpleasant status.

So now we have two cases:
1) system-wide setting for db connection is utf8 - all is in order RIGHT now.
2) system-wide setting for db connection is latin1 - the retrieved data from server are still correct RIGHT now, but data stored in database are screwed up and sorting (collation) of non-ASCII characters does not work. In fact, sorting will not fully work until we do not change collation of tables/columns to correct language collation, i.e. utf8_czech_ci.

Now,
1) if we migrate (or simply export) our databases to host with another system-wide character set, data will be all screwed up
2) if system administrator decides to change system-wide setting (i.e. from latin1 to utf8) or update system, data will be all screwed up
3) if web2project set connection character set to utf8 right now, data will be all screwed up on systems that have latin1 system-wide.

** SUMMARY **

This is not about badly set servers at all, this is about the bug in web2project. And IMO we should not fix this bug by simple enforcing utf8 connection charset because it may result in data loss if the database content will not be converted before the change.

If you decide, despite my opinion I will try to summarize in further text, to simple force utf8 connection charset, do it, I will not be pushing on you anymore, but please, please, make some BIG FAT warning about that change available to the web2project users.

Simple because it is a bug in web2project.
We shouldn't have been relying on system-wide connection character set because it does not need to be utf8! It can be set i.e. to latin1 and that is entirely proper configuration (forced by history and still existing old or buggy code as currently web2project is), not administrator fault. Web2project connection charset must match the character set of stored data. And we should have ensured this, not system administrators. We did not ensure this in the past and so, I think, we should make everything possible to guarantee database content will not be damaged if we now fix this web2project bug. Otherwise it will be better to do nothing.

There are new hosting servers with utf8 as default mysql charset but there are still ones with latin1. This will not change very soon. I am not web hosting administrator, but if I would be one and would set up a new mysql server I would set utf8 as default charset, but I would probably never change latin1 to utf8 on servers with already existing databases. Administrators of large hosting servers cannot simple change latin1 to utf8 system-wide or migrate sites to new server with utf8 set as default. This will effectively damage sites that do not set db connection charset explicitly. Their databases must be converted first. That is not trivial for sysadmins to achieve, because they must first know what character set for communication with DB has been using by each site. And it is inconvenient for their customers as well, if they should convert it themselves or take a risk their data will be lost. There are many servers with latin1 because of backwards compatibility with old/buggy/legacy projects they are hosting. Consider web2project as one of these. Web2project can be considered as partial reason why so many servers still set latin1 as default charset.

And there are also latin1 as default character set in mysql upstream [2], maybe for same reason.

Please make decision, I am convinced I am right (but who is not?), but I may be wrong (I was wrong many many times already).

After you make decision and changes in the code I will make appropriate article for wiki.

Cheers,
Petr


[1] http://cz.php.net/manual/en/function.mysql-set-charset.php and http://www.mail-archive.com/internals@lists.php.net/msg37958.html
[2] http://dev.mysql.com/doc/refman/5.0/en/charset-applications.html

~0001332

pedroa (administrator)

Hi Petr,

First of all thanks for beating me to a pulp on this one, and please keep doing it because it is being fun. :)

This is yet one more reason I am not too fond of mysql.

If you noticed we have been having a big deal of effort put in this utf8 subject in the past few months.
And the reason this was not being considered such a big thing, was that our primary markets are companies that mainly live with ascii or latin.
But we know that utf8 does not damage those markets and we can make web2project useful to loads more people if we find our way with utf8.

Now having said that, I find it hard to call it a web2Project bug, but rather work in progress.

The way I see it, we should attack this by:
1) Create a wiki page with all this discussion.
2) You have code to make a conversion, right?
Can it detect mysqls server encoding or how the data is encoded and make a conversion?
So lets talk people into making a conversion into utf8, if they require it.
3) Create an option on the system config, a checkbox saying... enforce utf8 data connections to the Database server. With a FAT NOTE pointing to the wiki page, and stating... "IF YOU ACTIVATE THIS OPTION AND HAVEN'T CONVERTED THE DATA ON THE DATABASE TO UTF8, YOUR DATA CHARACTERS WILL MOST PROBABLY END UP SCRAMBLED. YOU HAVE BEEN WARNED!!!
If you are unsure, leave this option unchecked and ask for support on web2Project Forums."
4) If that setting is true:
After line 18 on includes/db_adodb.php we test this setting and then we, depending on the DB engine we do the $db->charSet, or if we are using mysql (not mysqli as you mentioned) we put $db->EXECUTE("set names 'utf8'"); if that config option is enabled.

By default that options is false, or unchecked.

Again, thank you very much for the effort put in this, we appreciate it and you are far from being annoying, don't worry :)

Cheers,

Pedro A.

~0001333

pedroa (administrator)

... Oh and about Collation, sysadmin problem not mine/ours :)

Cheers,

Pedro A.

~0001334

gajdusek (reporter)

Last edited: 2010-10-28 15:18

Hi,

I agree mostly.

>2) You have code to make a conversion, right?
>Can it detect mysqls server encoding or how the data is encoded and make a conversion?

It detects system-wide connection charset, if it is already utf8 then no conversion is needed, if it differs from utf8, SQL statements needed to utf8 conversion are printed.

> 3) Create an option on the system config, a checkbox saying... enforce utf8
> data connections to the Database server.

I would not make it available in System admin/System configuration page. It will raise even more problems.

Just place it in config-dist.php set to true (for fresh installations using mysql >= 5.0.7) and advice users to set it to utf8 in config.php after they convert existing database. In the code check if this option is defined, if not consider it to be false. Setting it true to config-dist.php is ok until we are setting utf8 connection charset to mysql only.

Optimal would be adding this option to config.php (with appropriate comment) when updating code and make it true if mysql database is used and system default connection charset is utf8 already or make it false otherwise. But I have found no support for this in current code. And config.php must not be writable. So it is probably bad idea.

> 4) If that setting is true:
> After line 18 on includes/db_adodb.php we test this setting and then we,
> depending on the DB engine we do the $db->charSet, or if we are using mysql
> (not mysqli as you mentioned) we put $db->EXECUTE("set names 'utf8'");
> if that config option is enabled.

I would put (conditional variant of) $db->EXECUTE("set names 'utf8'"); only for mysql (after connection is accomplished), don't bother with other databases right now. $db-charSet will work only for oci8 and ibase drivers. I would wait with implementing this.

This solution is long way to be optimal but it is the most simple I can imagine.

==============

To final robust solution for all databases we should take following research and ideas in consideration:

I check the ADODB code through and must correct my statement about mysqli:
Setting $db->charSet before the connection is working only for 'ibase' and 'oci8' drivers (in ADODB version we use) and 'sybase' (in current upstream ADODB). 'mysqli' and 'postgres7' drivers have SetCharSet() methods implemented, they can be used to set encoding after connection was made. This is probably because mysqli does not allow specify connection charset when making connection. This method is not implemented for mysql - these ADODB drivers' methods are just wrappers to appropriate PHP modules (i.e. mysqli) and mysql PHP module does not have implemented this.

So we can force connection charset to oci8, ibase and sybase (if we update ADDODB to new version) drivers by setting $db->charSet('utf8') before actual connection happens. After connection has accomplished we can change charset for mysqli and postgres7 by $db->SetCharSet('utf8') method call. And finally we can force charset to mysql by mysql_set_charset('utf8',$db->_connectionID) and other databases in their specific way.

Right now I would change this only for mysql as I stated above. BTW we should consider use mysqli (MySQL Improved Extension) rather then mysql which is designed for mysql older then 4.1. See [1].

For mysql we should use either mysql_set_charset('utf8',$db->_connectionID) or execute an appropriate sql statement as you have proposed. Though, mysql_set_charset() is preferred way because it returns false if failed and sets internal mysql->charset properly (I am not sure what it means, but mysql_client_encoding() will not register change otherwise and it has probably something to do with mysql_real_escape_string() ). See link [1] bottom of my previous post. But that function is implemented only in PHP >= 5.2.3. Therefore we could simple make $db->EXECUTE("set names 'utf8'") as you proposed or use preferred mysql_set_charset('utf8',$db->_connectionID) and add this function in backcompat_functions.php. This way it is done in my patch.

Other databases charset should be implemented when we will fully support them in the code so we can make testing.

So, yes , set $db->charSet before connection is made, this will work only for ibase and oci8 drivers but can never hurt. After the connection is initiated call $db->SetCharSet() method if it is implemented. If it is not, set connection encoding in database driver specific way.

[1] http://cz.php.net/manual/en/intro.mysqli.php

Cheers,
petr

~0001335

gajdusek (reporter)

> ... Oh and about Collation, sysadmin problem not mine/ours :)

Nope :) But it is rather off topic to this bugreport. I will create new one :)

Cheers

~0001336

gajdusek (reporter)

Last edited: 2010-10-28 15:20

I realize we should not make config option that will force all databases connection charset to utf8. Because if we do it, we must ensure charset setting will immediately work for every database driver we will support in future.

So we must ensure all this thing will apply only to mysql right now:

1) Add 'mysql' case to db_connect() switch block - only here we should set connection character set.

this should be something like

if ( !empty( w2PgetConfig( 'dbconnection_charset' ) ) ) {
    if ( function_exists( 'mysql_set_charset' ) ) {
        mysql_set_charset( 'utf8', $db->_connectionID );
    } else {
        $db->Execute( 'SET NAMES "utf8"' );
    }
}

or better, just:

if ( !empty( w2PgetConfig( 'dbconnection_charset' ) ) ) {
    mysql_set_charset( 'utf8', $db->_connectionID );
}

and add this to includes/backcompat_functions.php:

// can be removed when we switch to PHP >= 5.2.3
if (function_exists('mysql_set_charset') === false) {
    /**
     * Sets the client character set.
     *
     * Note: This function requires MySQL 5.0.7 or later.
     *
     * @see http://www.php.net/mysql-set-charset
     * @param string $charset A valid character set name
     * @param resource $link_identifier The MySQL connection
     * @return TRUE on success or FALSE on failure
     */
    function mysql_set_charset($charset, $link_identifier = null)
    {
        if ($link_identifier == null) {
            return mysql_query('SET NAMES "'.$charset.'"');
        } else {
            return mysql_query('SET NAMES "'.$charset.'"', $link_identifier);
        }
    }
}

Though, a bit refactoring of db_connect function will be appropriate, i.e. moving connection code from switch cases before switch block (that code is repeated in all cases).

Now if somebody adds $w2PConfig['dbconnection_charset'] = true to config.php, utf8 will be used for connection.

As addition to this we should add default (=false) value to config-dist.php. So we can place there a comment.

Better, if we set it to DBCONNECTION_CHARSET string that will be replaced by installer to false or, in case mysql will be chosen and it will be >= 5.0.7 version, to true. So fresh installations will be using utf8.


2) So something like this to config-dist.php

//set this value to true to use UTF8 for database connection, keep it false to use system default
$w2PConfig['dbconnection_charset'] = DBCONNECTION_CHARSET;

along with a FAT NOTE pointing to the wiki page, and stating...
1) KEEP THIS OPTION FALSE IF YOU DO NOT USE MYSQL >= 5.0.7 AS YOUR DATABASE.
2) IF YOU SET THIS TO TRUE AND HAVEN'T CONVERTED THE DATA ON THE DATABASE TO UTF8, YOUR DATA CHARACTERS WILL MOST PROBABLY END UP SCRAMBLED.
YOU HAVE BEEN WARNED!!!

3) Installation script will replace DBCONNECTION_CHARSET with FALSE or with TRUE if mysql database is used and if it is new enough (version check).

That's all.
This way we can add support for other database drivers later with no pain.

There are two more functions connecting directly to DB, please, consider if they must be changed too.

They are in classes/w2p/Core/UpgradeManager.class.php and install/manager.class.php

PS: I did not test that code parts I added into this post.

Cheers,
Petr

~0001375

caseydk (administrator)

Wow, you guys talk a lot. ;)

Some quick points:

1) I've logged the "installer should support table prefixes" as 0000648.

2) For new installs, the utf-8 tables should be easy to support. We can add the mysql version check and enable the checkbox if it's 5.0.7 or above. Someone please file a request on that one with all relevant details.

3) For existing installs, this will be trouble.. having a one-time switch in the System Admin seems like a reasonable option *IF* we can detect if the character problem exists in their database:
* If it's just a setting problem but not data, let's change the setting, no problem.
* If there's a problem with their data, I'd prefer to show the disabled setting box with an explanation. I don't want to take responsibility for breaking their stuff.

4) Updating the minimum PHP version can only happen on Major releases (each June). I'm *guessing* we should go to 5.2.6 or 5.2.10 by that time. In the meantime, using the backcompat_functions is the way to go. Nice catch gajdusek.

5) If *either* of you find MySQL specific commands, etc outside the install/sql/mysql directory, could you mark those with something like:

// TODO: mysql specific

I'd like to have those findable as I'm digging through SQL Server support now.

~0001381

gajdusek (reporter)

Hi, please look at https://github.com/gajdusek/web2project/commits/issue598

It is my patch covering this issue: support of using UTF8 as database connection charset.

**** Short summary ****

It uses new $w2Pconfig['dbconnection_charset'] setting in config.php to determine whether force utf8 database connection.

There is no checkbox on installation time, instead this option is autodetected during fresh install or d2p migration. Detection is performed by the w2p_Core_UpgradeManager::DBConnectionCharsetSafetyCheck() method. This could be safely autodetected on upgrades as well, but I've found no support of changing already existing config.php.

IMHO there is no need for checkbox. If it is safe to use utf8 connection then it should be used, because w2p code assumes it already. There is no explanation why this option was detected to be unsafe (or not implemented for used db type), I have tried implement this but gave up. I remember no more why exactly. Instead, explanation will appear in sysadmin module (this way the already existing installations are covered too), but only if some user action can solve the issue (i.e. converting database). Nothing will be displayed if $w2Pconfig['dbconnection_charset'] is already enabled or mysql version is too old or unsupported database type is used.

*** caveeats ***

- Only mysql is supported right now
  * the patch has no problem if unsupported DB is used
  * support for other database can be easily and safety added anytime later

- Database converting is not implemented (but external script exists)
- Database conversion and utf8 connection enabling must be atomic
- Corresponding wiki pages do not exist yet

~0002253

caseydk (administrator)

Sorry that I didn't see this one get updated.. wow.

Is this patch still applicable?

~0002368

caseydk (administrator)

I'm looking to roll this into the v3.0 release here shortly.

~0003505

caseydk (administrator)

Resolved in v4.0 development:
https://github.com/web2project/web2project/commit/e39e9dc0ee687d0eba4c145e3703e54ea5400495

~0003759

opto (manager)

unfortunately, in 3.3, the patch for 4 does not work, I have not yet managed to get my old 3.2 db display correctly in 3.3 - all db haractersets etc. seem to be ok.
+Notes

-Issue History
Date Modified Username Field Change
2010-10-09 10:10 gajdusek New Issue
2010-10-19 11:27 pedroa Status new => assigned
2010-10-19 11:27 pedroa Assigned To => pedroa
2010-10-19 18:19 gajdusek Note Added: 0001314
2010-10-19 18:20 gajdusek File Added: dbcharset_support.patch
2010-10-19 18:44 gajdusek Note Edited: 0001314
2010-10-26 06:28 pedroa Note Added: 0001318
2010-10-26 06:28 pedroa Status assigned => feedback
2010-10-26 06:32 pedroa Note Added: 0001319
2010-10-26 06:33 pedroa Relationship added related to 0000587
2010-10-26 17:17 gajdusek Note Added: 0001328
2010-10-26 17:42 gajdusek Note Edited: 0001328
2010-10-26 18:24 gajdusek Note Added: 0001329
2010-10-26 18:47 gajdusek Note Edited: 0001328
2010-10-26 18:59 gajdusek Note Edited: 0001329
2010-10-27 08:18 pedroa Note Added: 0001330
2010-10-27 20:09 gajdusek Note Added: 0001331
2010-10-28 02:30 pedroa Note Added: 0001332
2010-10-28 03:27 pedroa Note Added: 0001333
2010-10-28 11:30 gajdusek Note Added: 0001334
2010-10-28 12:29 gajdusek Note Edited: 0001334
2010-10-28 12:31 gajdusek Note Added: 0001335
2010-10-28 14:51 gajdusek Note Added: 0001336
2010-10-28 14:59 gajdusek Note Edited: 0001336
2010-10-28 15:07 gajdusek Note Edited: 0001334
2010-10-28 15:12 gajdusek Note Edited: 0001334
2010-10-28 15:18 gajdusek Note Edited: 0001334
2010-10-28 15:20 gajdusek Note Edited: 0001336
2010-11-24 23:35 caseydk Project v2.1 Release (Closed) => v2.2 Release (Closed)
2010-11-25 02:48 pedroa Relationship added child of 0000632
2010-11-26 07:50 caseydk Note Added: 0001375
2010-11-26 10:36 gajdusek Note Added: 0001381
2010-12-15 21:55 caseydk Project v2.2 Release (Closed) => v2.3 Release (Closed)
2011-02-23 22:47 caseydk Relationship added parent of 0000731
2011-02-27 15:10 caseydk Relationship replaced has duplicate 0000731
2011-03-22 12:04 caseydk Project v2.3 Release (Closed) => v2.4 Release (Closed)
2011-07-18 12:23 caseydk Project v2.4 Release (Closed) => v3.0 Release
2011-10-16 12:59 caseydk Note Added: 0002253
2011-12-27 23:08 caseydk Note Added: 0002368
2011-12-27 23:08 caseydk Status feedback => assigned
2011-12-27 23:08 caseydk Assigned To pedroa => caseydk
2012-01-01 23:03 caseydk Project v3.0 Release => Pending Requests
2012-12-18 22:39 caseydk Project Pending Requests => v3.0 Release
2012-12-18 22:51 caseydk Relationship added related to 0001227
2012-12-22 18:13 caseydk Project v3.0 Release => Pending Requests
2014-01-21 17:56 caseydk Target Version => 4.x
2014-02-27 00:34 caseydk Product Version v2.1 =>
2014-02-27 00:34 caseydk Target Version 4.x =>
2014-04-05 17:24 caseydk Category General => Core Infrastructure
2014-05-20 19:46 caseydk Project Pending Requests => v4.0 Release (Planning)
2014-06-10 22:49 caseydk Target Version => 4.0
2014-06-21 20:08 caseydk Note Added: 0003505
2014-06-21 20:08 caseydk Status assigned => resolved
2014-06-21 20:08 caseydk Fixed in Version => 4.0
2014-06-21 20:08 caseydk Resolution open => fixed
2015-09-01 12:21 opto Note Added: 0003759
2015-09-01 12:21 opto Status resolved => feedback
2015-09-01 12:21 opto Resolution fixed => reopened
2016-12-26 10:25 caseydk Project v4.0 Release (Planning) => Pending Requests
2016-12-28 00:17 caseydk Relationship added related to 0001671
+Issue History