Issue 11810 - std.stdio.byLine/readln performance is very bad
Summary: std.stdio.byLine/readln performance is very bad
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: All All
: P2 enhancement
Assignee: No Owner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-24 04:34 UTC by Peter Alexander
Modified: 2016-05-30 15:54 UTC (History)
6 users (show)

See Also:


Attachments
byLineFast with small changes (5.91 KB, application/octet-stream)
2013-12-24 13:11 UTC, bearophile_hugs
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Peter Alexander 2013-12-24 04:34:29 UTC
std.stdio.readln (and hence byLine) use repeated calls to fgetc() to find the new line characters. This is a very inefficient way to read files (lots of per-byte overhead).

I have a version of byLine that reads the files in 4kb chunks and then does the new line search. It is 6 times faster than byLine on my machine on a 10MB file (OSX 10.8.5, x64 2GHz MacBook).
Comment 1 Dejan Lekic 2013-12-24 04:42:03 UTC
It has been discussed on IRC hundreds of times and we all agreed that if developer wants performance (s)he would read page-size chunks. That is why we have byChunk(size_t) in std.stdio, I believe. :)
Comment 2 Peter Alexander 2013-12-24 05:27:14 UTC
(In reply to comment #1)
> It has been discussed on IRC hundreds of times and we all agreed that if
> developer wants performance (s)he would read page-size chunks. That is why we
> have byChunk(size_t) in std.stdio, I believe. :)

OK, but:

1. It's non-trivial to implement byLine on top of byChunk.

2. Why would you want byLine to be slow? I'm not seeing the advantage of keeping byLine as it is. Fixing it doesn't change the API and has no downsides other than requiring a bit extra memory for the buffer.
Comment 3 bearophile_hugs 2013-12-24 10:08:13 UTC
(In reply to comment #1)
> It has been discussed on IRC hundreds of times and we all agreed that if
> developer wants performance (s)he would read page-size chunks. That is why we
> have byChunk(size_t) in std.stdio, I believe. :)

This is not acceptable. byLine is a very commonly used function (far more than byChunk in script-like D programs) and it should be sufficiently fast.
Comment 4 Artem Tarasov 2013-12-24 10:47:16 UTC
+1

There's also this implementation: http://permalink.gmane.org/gmane.comp.lang.d.general/117750
Comment 5 bearophile_hugs 2013-12-24 13:11:31 UTC
Created attachment 1305 [details]
byLineFast with small changes
Comment 6 hsteoh 2013-12-26 14:23:27 UTC
The whole point of byLine is to be a convenient API for user code to read lines from a file. It should not be constrained to using fgetc() just because we can't predict line length in advance. It should be built on top of a buffering mechanism (maybe byChunk) so that it offers good performance to a very commonly-used user operation. I highly recommend Peter Alexander to submit the improved byLine implementation to Phobos.
Comment 7 Nick Treleaven 2014-02-28 08:55:17 UTC
(In reply to comment #5)
> Created an attachment (id=1305) [details]
> byLineFast with small changes

BTW I'm working on porting this to std.stdio.byLine, I'll submit a PR when finished.
Comment 8 Nick Treleaven 2014-04-10 15:51:02 UTC
Here's as far as I got:
https://github.com/ntrel/phobos/commits/byLine-fast

This is faster than standard byLine on my Windows box iff std.utf.validate is not called for each line. To be honest, I'm not sure whether std.stdio.readln is doing UTF validation or not. readln has different implementations so performance of standard byLine varies by platform.

I'm not sure whether this would be accepted as a PR. One issue is that when using stdin you couldn't input just one line at a time anymore.
Comment 9 bearophile_hugs 2014-04-10 16:52:15 UTC
(In reply to Nick Treleaven from comment #8)

> I'm not sure whether this would be accepted as a PR.

There is also a need for benchmarks, comparing it to the Phobos byLine, to the byLineFast that is in attach, and to equivalent Python2.x and Java code.
Comment 10 Andrei Alexandrescu 2015-03-22 07:18:49 UTC
https://github.com/D-Programming-Language/phobos/pull/3089
Comment 11 Andrei Alexandrescu 2015-03-22 07:19:23 UTC
http://forum.dlang.org/thread/melpfi$2scc$1@digitalmars.com