Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 0.1.6

Bug #10911 DataSource::CSV does not have a very good CSV parser
Submitted: 2007-05-02 19:10 UTC
From: dgreco3917 Assigned: wiesemann
Status: Closed Package: Structures_DataGrid_DataSource_CSV
PHP Version: 5.2.1 OS: Linux
Roadmaps: (Not assigned)    
Subscription  


 [2007-05-02 19:10 UTC] dgreco3917 (Dave Greco)
Description: ------------ The CSV parsing in DataSource::CSV is weak. All it does is explodes the string on the delimiter (default to comma). This results in fields containing the delimiter being split in two. Programs like Excel will generally output their CSV for fields like that as ,"Field, with comma",. Test script: --------------- Here's a sample CSV line: field1, 2192, "field, with comma", foo This should parse into array('field1', 2192, 'field, with comma', 'foo') instead it parses into array('field1', 2192, '"field', ' with comma"', 'foo');

Comments

 [2007-05-03 20:10 UTC] olivierg (Olivier Guilyardi)
Thank you Dave for your contribution. Indeed, the CSV parser is rather poor. Could you please ensure that your patch is fully working by testing it against the fgetcsv() phpt file: http://cvs.php.net/viewvc.cgi/php-src/ext/standard/tests/file/fgetcsv.phpt?revision=1.2&view=markup ? That would really help us.
 [2007-05-03 20:12 UTC] olivierg (Olivier Guilyardi)
(fixing package and bug type)
 [2007-05-03 20:13 UTC] wiesemann (Mark Wiesemann)
Ups, we were working in parallel on this bug. My (optimized, mostly in regards to the PEAR CS) solution is now in CVS. The csv2arr() function seems to be taken from the user comment section from www.php.net/fgetcsv.
 [2007-05-03 20:13 UTC] wiesemann (Mark Wiesemann)
(back to "bug", I wanted to change this also)
 [2007-05-03 20:14 UTC] wiesemann (Mark Wiesemann)
(and back to package correction; other comments collided, sorry)
 [2007-05-03 20:25 UTC] wiesemann (Mark Wiesemann)
Some tests fail, but only the complicated ones like: 'aaa"aaa","bbb"bbb' The following line fails on the second field because we don't trim(): '"aaa", "bbb"'
 [2007-05-03 20:33 UTC] olivierg (Olivier Guilyardi)
In my own experience, whenever you export 10000 csv rows from Excel, FileMaker, <other_broken_app>, etc.. the complicated ones usually appear. I think we should fix this. Maybe we should let Dave do that, so that he can achieve the goal of his contribution? Dave?
 [2007-05-03 20:36 UTC] wiesemann (Mark Wiesemann)
Yes, Dave, please try to find a solution. Here is the shortened list of failing rows: $list = array( '"""""",', '""""",aaa', '"\\""",aaa', 'aaa,"\\"bbb,ccc', 'aaa"aaa","bbb"bbb', ); '"aaa", "bbb"' shouldn't be fixed (IMO) because that would break backwards compatibility.
 [2007-05-03 21:28 UTC] dgreco3917 (Dave Greco)
Sure will do. Hopefully I can ensure it pasts the tests tomorrow.
 [2007-05-04 14:41 UTC] dgreco3917 (Dave Greco)
The regex is proving to be very difficult. Would you be opposed to this approach? When provided the path to the file, just use fgetcsv() to do the parsing. When provided a string of CSV, write it out to a temp file and use fgetcsv() to do the parsing.
 [2007-05-04 14:57 UTC] olivierg (Olivier Guilyardi)
This is heavy. Reading/writing to the disk is slow. When it's a string, the parsing must be performed in memory. I'm sure you can do it :)
 [2007-05-04 15:13 UTC] wiesemann (Mark Wiesemann)
> I'm sure you can do it :) But you don't need to stick to this regexp idea, of course. There might be better solutions out there. Unfortunately, PHP_Compat hasn't yet a solution for fgetcsv(). For the case of filenames, we could consider using fgetcsv(), but we would need to make sure then that both solutions produce the same output (which be not be possible).
 [2007-05-04 15:31 UTC] wiesemann (Mark Wiesemann)
If we would require PHP 5.1, we could maybe use the memory stream (*) together with fgetcsv(). If we don't find a reasonable and working solution for Dave's bug, we should consider the following: - PHP < 5.1.0: use our old parser - PHP >= 5.1.0: use fgetcsv() (*) http://www.php.net/manual/en/wrappers.php.php
 [2007-05-04 15:51 UTC] olivierg (Olivier Guilyardi)
I'm sure it's possible to make a single implementation for both PHP4 and PHP5. Checking the PHP version from a script is a sort of hack IMO, and the current case doesn't justify it.
 [2007-05-04 15:58 UTC] dgreco3917 (Dave Greco)
I like the idea of using fgetcsv(), simply because it works quite well already and any future modifications to it won't have to be copied over to whatever DataSource::CSV uses. And of course I'm fine with keeping the existing behavior for < 5.1 as I am running 5.2 :) That being said, I am probably just a few tweaks away from some code (regex and regular string manipulation) that will work.
 [2007-05-04 16:01 UTC] olivierg (Olivier Guilyardi)
I vote for the few tweaks :)
 [2007-05-04 18:08 UTC] dgreco3917 (Dave Greco)
See the latest patch I provided. It produces the same output for all but 3 of the tests from fgetcsv. This is certainly good enough for my usage, and a lot better than the previous behavior. The three tests it does not work for are: """"",aaa "\""",aaa aaa,"\"bbb,ccc fgetcsv doesn't split on the comma in the first 2, and doesn't split on the second comma in the third case. My provided patch does. I'm not sure either behavior is more desirable than the other, as the data is largely garbage. I will be using this code in a production environment within the next week or so, which should provide a decent test.
 [2007-05-04 19:56 UTC] wiesemann (Mark Wiesemann)
I haven't looked at the new patch yet, but want to give a comment on the version switch: I can find neither the email nor a place in the SDG code, but we had discussed a PHP 5 switch before. If we can use things that are implemented in PHP itself, this is IMHO good. Normally, PHP functions are well tested, and also important, they are faster than PHP code. In the current case, fgetcsv() has the benefit that it passes all the test cases from the .phpt test. My proposal on using fgetcsv() and streams is only an idea, and I'm not sure whether this works the way I'm thinking of it. But I would say that it would be worth trying it. I'll post a comment on the new patch tomorrow.
 [2007-05-05 13:26 UTC] wiesemann (Mark Wiesemann)
I have tried the php://memory variant. With this, the test completly passes. If I run the original test (fgetcsv.phpt), I get the following rather funny diff: 075- string(7) """,aaa 075+ string(8) """,aaa 080- string(8) "\"",aaa 080+ string(9) "\"",aaa 087- string(10) "\"bbb,ccc 087+ string(11) "\"bbb,ccc The code for the foreach() loop is: $fp = fopen("php://memory", 'r+t'); fputs($fp, $v . "\n"); rewind($fp); var_dump(fgetcsv($fp)); Dave's new patch gave me at first several warnings (line 186 assumpted that $str{0} exists always). After I had fixed that, several line tests still failed here. I asked Google and Google's Code Search for other implementations of CSV parsers, but couldn't find something reasonable and/or usable for us. Maybe the best idea would be to use the PHP version switch and to use fgetcsv() for PHP >= 5.1.0 and either the old implementation or one of Dave's patches for PHP < 5.1.0. A short sentence in the manual could explain the differences and mention the recommendation to use PHP >= 5.1.0.
 [2007-05-07 09:01 UTC] olivierg (Olivier Guilyardi)
With a single php parser implementation, the code would be more readable and maintainable than a dual implementation with a version switch, and it would behave the same in PHP4 and PHP5. It's just cleaner and more coherent IMHO. By the way, I'm not saying version switches are always bad. I just think we must minimize their use as much as possible to make the code easier to read and maintain. So... What about writing our own little stream wrapper: http://fr.php.net/manual/en/function.stream-wrapper-register.php ? That seems to be possible as of PHP 4.3.2. This way we wouldn't need any version switch. Side-note: I uploaded the modified phpt file on CVS. It will now be available by running: pear run-tests -p Structures_DataGrid_DataSource_CSV
 [2007-05-07 10:22 UTC] wiesemann (Mark Wiesemann)
Well, if we find a good working PHP implementation, then yes, I'm also in favor of using PHP code. And with our own stream wrapper this would work for both ways (string and filename). But if we don't find a good implementation in the near future, then we should consider either the version switch, a BC break (=> require PHP 5.1 for the CSV driver), or make a SDG_DS_CSV2 driver that requires PHP 5.1. > Side-note: I uploaded the modified phpt file on CVS. It > will now be available by running: > pear run-tests -p Structures_DataGrid_DataSource_CSV Hmm, did you mean "I will upload ..."? I cannot see a change in CVS.
 [2007-05-07 11:04 UTC] olivierg (Olivier Guilyardi)
> Well, if we find a good working PHP implementation, > then yes, I'm also in favor of using PHP code. And with > our own stream wrapper this would work for both ways > (string and filename). Not sure you caught my point: writing our own stream wrapper would allow to use fgetcsv() (not PHP code) for both strings and files, under both PHP4 and PHP5. This would be exactly like "php://memory", but would work under PHP4. I'm 90% sure this can work. > But if we don't find a good implementation in the near > future, then we should consider either the version switch, > a BC break (=> require PHP > 5.1 for the CSV driver), or > make a SDG_DS_CSV2 driver that requires PHP 5.1. I'm sure we can avoid such version and BC mess. > Hmm, did you mean "I will upload ..."? I cannot see a > change in CVS. It's in a new directory, so you need to run: cvs update -d
 [2007-05-07 11:06 UTC] olivierg (Olivier Guilyardi)
(Mark: about the phpt file, I was talking about the one I uploaded last friday)
 [2007-05-07 11:16 UTC] wiesemann (Mark Wiesemann)
> Not sure you caught my point: writing our own stream > wrapper would allow to use fgetcsv() (not PHP code) > for both strings and files, under both PHP4 and PHP5. I got that. But this only makes sense with a good working PHP implementation of such a CSV parser. The current implementation and Dave's patches are a good start, but not perfect yet. About the .phpt: I wasn't aware that this -p trick already works with the current test. I had used "pear run-tests" and "pear run-tests bug-10911.phpt" which worked.
 [2007-05-07 11:23 UTC] olivierg (Olivier Guilyardi)
Mark, with a wrapper, there is _no_need_ for a PHP parser implementation. What I'm saying is that _all_ the parsing would be done by fgetcsv() for strings and files under PHP4 and PHP5. fgetcsv() would be the only parser. The custom-made stream wrapper would allow to pass a string to fgetcsv() under PHP4 and PHP5, just like "php://memory" which is only available in PHP5.
 [2007-05-07 11:28 UTC] wiesemann (Mark Wiesemann)
Ups, thanks for the correction. I had only scanned through the first sentences of the manual page and got the wrong idea of it. This seems to be the perfect solution then. PHP 4.3.2 is newer than what PEAR currently requires (4.3.0), but we can ignore that. People with 4.3.0 or 4.3.1 maybe have a big (e.g. security) problem anyway.
 [2007-05-07 11:38 UTC] olivierg (Olivier Guilyardi)
Indeed... I just scrolled a bit more on the manual page. It's in the comments... Look at simon at firepages dot org's post. It's exactly what we need ;)
 [2007-05-07 12:56 UTC] wiesemann (Mark Wiesemann)
> Indeed... I just scrolled a bit more on the manual page. > It's in the comments... Look at simon at firepages dot > org's post. It's exactly what we need ;) Ah, very nice. Will you work on the changes to our driver or should I?
 [2007-05-07 14:20 UTC] olivierg (Olivier Guilyardi)
Please, don't forget the author of this request... Dave, do you want to work on the stream wrapper and fgetcsv() integration?
 [2007-05-07 14:26 UTC] dgreco3917 (Dave Greco)
I wouldn't mind doing it, but I won't have the time for the foreseable future (at least the next couple of weeks). I won't feel bad either if you decide someone else will do it. In that case, I'd be happy to test when it is complete.
 [2007-05-07 14:59 UTC] olivierg (Olivier Guilyardi)
Well... Mark, please do fix this bug if you have the time.
 [2007-05-07 20:15 UTC] wiesemann (Mark Wiesemann)
Okay, a first version is now in CVS. The class from the PHP manual needed some changes (usage of $GLOBALS did not make sense) and some workarounds and additions. Passing longer CSV strings directly to fopen() resulted in various different errors; therefore, fwrite() is now used. Reading from files and strings works here on Windows with PHP 5.2.1 and 4.4.4. PHP 4 needs the ugly hack with the filesize() call, while PHP 5 also accepts null as the length value in the fgetcsv() call. More tests are welcome. BTW: As we're using fgetcsv() now, I have added the new option 'enclosure'.
 [2007-05-07 20:47 UTC] olivierg (Olivier Guilyardi)
That looks tricky... Did you try the phpt test? It fails in here.
 [2007-05-07 20:51 UTC] wiesemann (Mark Wiesemann)
> Did you try the phpt test? It fails in here. Yes, just as the email with your comment arrived. ;-) Please get the latest version from CVS. Three test cases fail then (the same that made trouble before here).
 [2007-05-07 22:06 UTC] wiesemann (Mark Wiesemann)
I just remembered a difference between the original fgetcsv() test script and the script in CVS: the linebreak was missing. After adding it, our test file now passes! (Changes are in CVS.)
 [2007-05-08 09:49 UTC] olivierg (Olivier Guilyardi)
Indeed, the test passes under PHP 5.2.0. Great job :) However, it fails under PHP 4.4.4. You shouldn't use stream_get_wrappers() which is PHP5 only.
 [2007-05-08 10:00 UTC] wiesemann (Mark Wiesemann)
Olivier, can you please run the test again on PHP 4 with the latest CVS code? (It still works with PHP 5 here.)
 [2007-05-08 10:57 UTC] olivierg (Olivier Guilyardi)
It works fine under PHP4 :) By the way, I renamed bug10911.phpt to ds_csv_strings.phpt in CVS (ignore my commit message where I say it got renamed to ds_csv_file.phpt, it's a mistake). I also wrote some new tests: - ds_csv_multiline_string.phpt: multi-line string parsing - ds_csv_multiline_string_dos.phpt: multi-line (DOS) string parsing - ds_csv_file.phpt: file parsing These three tests fail on both PHP4 and PHP5. It looks like it relates to some line breaks getting included between quotes. Not sure what to do with it.
 [2007-05-08 14:57 UTC] olivierg (Olivier Guilyardi)
Mark, if you think those extra tests are superfluous, feel free to remove them. I just think we should at least have two tests: a string-based one and a file-based one (similar to the original fgetcsv test). Dave, how's the new driver working for you?
 [2007-05-08 16:16 UTC] wiesemann (Mark Wiesemann)
I'll look tomorrow at the new tests.
 [2007-05-09 11:00 UTC] wiesemann (Mark Wiesemann)
I'm not really sure about these tests and why they fail. The file generated by the file test case has in all cases \\ replaced by \ (not sure whether that's good or not). If I modify the file test to the way the fgetcsv.phpt works (i.e. write each line to the file and read it with, write again and read again, etc.), then I get the same results as with fgetcsv.phpt. Not a 100% passed test, but again the following output in the .diff file: 075- string(7) """,aaa 075+ string(8) """,aaa 080- string(8) "\"",aaa 080+ string(9) "\"",aaa 087- string(10) "\"bbb,ccc 087+ string(11) "\"bbb,ccc That the two multiline tests fail, seems to have a similar reason. I had similar problems when I tested while adding the new stream class: While working with a filename worked without problems, putting the contents of the file into a heredoc ($csv = <<<CSV ... CSV;) didn't work. Conclusions: - ds_csv_file.phpt should IMO be rewritten (only a line per file, read it, var_dump() the result, repeat with the next line) - multiline tests: it would make sense to have some tests, but I have no idea on how we could get them (or our driver) to work
 [2007-05-09 11:05 UTC] wiesemann (Mark Wiesemann)
I have updated my PHP from 5.2.1 to 5.2.2: fgetcsv.phpt still fails with the funny diff, but my modified ds_csv_file.phpt passes now.
 [2007-05-09 11:13 UTC] wiesemann (Mark Wiesemann)
Another update: If a change the fopen mode from 'wt' to 'wb' in fgetcsv.phpt, this test passes, too. A better fix is (according to the fopen() docs) to use still 'wt' but 'rt' in the second fopen() call. It passes then, too. I'm now unsure what the best modes for us would be. Should we use "wt" in ds_csv_file.phpt and "rt" in the CSV driver? (It still passes then.)
 [2007-05-10 12:26 UTC] olivierg (Olivier Guilyardi)
Mark, I'm getting confused about the modified tests that your report to be working and those that don't. Can you please upload you modifed test case(s) as new phpt files? We'll eventually remove those that we don't need.
 [2007-05-10 13:36 UTC] wiesemann (Mark Wiesemann)
I've simply committed the changed ds_csv_file.phpt. It works now like the original test case (fgetcsv.phpt). Putting the lines first together and writing them into a file, and then reading them, seems to fail for the same (unknown) reason than why the multiline test fail. What we could to do to make the multiline test cases work: - put a CSV file into CVS - use file_get_contents() to read this file and let the multiline test work on this string (could also be done for a file with \r\n linebreaks) We could then add another test that uses the file mode of our driver to read the file. BTW: There was a question hidden in my last comments: What's the best fopen() mode to use for our driver and our tests? How about 'rt' and 'wt' which the fopen() docs recommend?
 [2007-05-10 15:47 UTC] olivierg (Olivier Guilyardi)
> I've simply committed the changed ds_csv_file.phpt. > It works now like the original test case (fgetcsv.phpt). This test failed for me under both php 4.4.4 and 5.2.0 The problem came from the $length argument of fgetcsv() which value is cached behind filesize(). I fixed it with clearstatcache(). It's on CVS. The tests now passes on both PHP4 and PHP5. I think the multi-line tests don't really fail. The expected results are wrong. If you take the 'aaa,"\\"bbb,ccc' line, the expected result is : array(2) { [0]=> string(3) "aaa" [1]=> string(10) "\"bbb,ccc " } Here you can see a line break in the second string. This looks right. However, with the multi-line test, the second string gets concatenated with the following line. This is right too. So I propose we remove the multi-line tests, and only keep ds_csv_strings.phpt and ds_csv_file.phpt which both pass. Regarding rt and wt, the fopen() manpage states: "For portability, it is strongly recommended that you always use the 'b' flag when opening files with fopen()." "Again, for portability, it is also strongly recommended that you re-write code that uses or relies upon the 't' mode so that it uses the correct line endings and 'b' mode instead." If I understand correctly, the 't' mode translates writing \n to \r\n and reading \r\n to \n, when running on windows, but has no effect on nix. This is bad for us because it will behave differently on windows and nix. The PHP team of course test their C code under both platforms, but it's generally not our case because we write PHP, not C. We better stick to 'b' for maximum portability. So IMO we should use rb in the driver and wb in the test. I fixed this in CVS. I think we might now close this bug if it's okay with you, after we have removed the unnecessary tests, and once Dave tells us it works ok for him.
 [2007-05-10 15:58 UTC] wiesemann (Mark Wiesemann)
Olivier, thanks for adding the clearstatcache() call. I had also added it, but only in my test directory. :-( I agree on your 'b' mode suggestion; this should be right choice according to your quotes from the manual. And I'm also fine with removing the multiline tests. What you wrote about them makes sense. There is another thing that we should discuss: The new stream class is in no way restricted to CSV. It is very flexible and can be used to pass strings to PHP's file functions. Should we maybe put this class with another name into the core package, so that we will be able to use the stream also with another driver (for reading or writing)?
 [2007-05-10 16:19 UTC] olivierg (Olivier Guilyardi)
I did not notice you created a CSV subdirectory and put the stream class in there. This is heavy IMO. As they do in MDB2, I think there's nothing wrong with putting this class inside the CSV.php file, because it's for internal use only. IMO this class is good for CSV but, for now, nothing makes me think we'll need it for other drivers. For performance, I think our priority should be not to parse it if not needed, and to avoid loading it from an external file. Let's just bundle it at the bottom of CSV.php. If we ever find that we need it for other purposes, we might indeed put it in the core, in the future. Additionally I see two problems with the "csvstream" wrapper name: - This name might collide with users' wrappers - it would be "csv stream" if you called fgetcsv() within the wrapper class. This class is more a "memory" stream IMO, because we do the csv parsing outside of it. So I'd recommend we call it: "structures_datagrid_datasource_csv_memory" I know it's long, but what else?
 [2007-05-10 16:26 UTC] wiesemann (Mark Wiesemann)
I'm fine with the stream name change. It's long, but this way it is very unlikely that it will collide with another stream. About the separate file: I remember a "one class = one file" rule. I can't find it now, though. But I would do it even for classes that we would mark as private in PHP 5. Many PHP 5 packages have even exceptions classes with only a lines of code in separate files.
 [2007-05-12 13:53 UTC] wiesemann (Mark Wiesemann)
I have renamed the class and the protocol (ugly name, c.p. my commit message). My test scripts works with 5.2.2 and 4.4.4, and "pear run-tests" works with 5.2.2. Olivier, can you please run the .phpt tests with 4.x? Dave, can you find some minutes in the next days to test the new version from CVS?
 [2007-05-14 13:11 UTC] olivierg (Olivier Guilyardi)
It works great in here under both php 4.4.4 and 5.2.0. I allowed myself to change the wrapper name to use dashes. It doesn't fail as with underscores, and looks a little prettier IMO, don't you think?
 [2007-05-14 14:01 UTC] wiesemann (Mark Wiesemann)
> I allowed myself to change the wrapper name to use > dashes. It doesn't fail as with underscores, and > looks a little prettier IMO, don't you think? Yes, much better. The camel caps were a quick fix after I got this weird error that shouldn't occur (<= I love these kind of comments, BTW *g*).
 [2007-05-14 21:24 UTC] olivierg (Olivier Guilyardi)
I suppose it's just quick assert() somewhere... You should report a PHP bug. Okay, so I think we can close this bug. Our unit test is rather reliable IMO, it should work for Dave.
 [2007-05-15 08:58 UTC] wiesemann (Mark Wiesemann)
This bug has been fixed in CVS. If this was a documentation problem, the fix will appear on pear.php.net by the end of next Sunday (CET). If this was a problem with the pear.php.net website, the change should be live shortly. Otherwise, the fix will appear in the package's next release. Thank you for the report and for helping us make PEAR better. --- I have also reported the stream bug (was marked "bogus", but I have left another comment.)