lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Code review request


From: Vadim Zeitlin
Subject: Re: [lmi] Code review request
Date: Thu, 24 Sep 2020 19:03:33 +0200

On Wed, 23 Sep 2020 02:57:12 +0000 Greg Chicares <gchicares@sbcglobal.net> 
wrote:

GC> But in light of your response, let me pose a somewhat different question:
GC> Is the historical practice so atrociously bad that it would be wrong
GC> to apply it in new circumstances?

 Hello,

 I wouldn't say that it's atrociously bad, but using doubles for storing
integers and booleans doesn't seem very tidy neither.

GC> So...is my database-string proposal at worst a pity? Or is it so
GC> woeful that, if you were a member of a universal parliament, you
GC> would pass a law making it a crime? Put that way, the question does
GC> seem to answer itself. But maybe this is a gray area, and you're
GC> itching to show me another way that I might recognize as clearly
GC> superior?

 IMO the main benefit of your proposal is that it's incremental. This is an
important advantage and I don't want to undervalue it. But if I really was
writing new code from scratch, I probably wouldn't do it like this.

GC> >  This is probably a very naive question, but why can't we just put the
GC> > .policy files contents into .database files? Is it just because changing
GC> > the code dealing with the .database files to accept strings, i.e. use
GC> > something like "std::variant<bool, int, double, std::string>" instead of
GC> > just "double" in database_entity, so complicated that we can't
GC> > realistically afford to make this change?
GC> 
GC> Yes. It took a lot of thought to design database_entity::operator[]()
GC> and similar code, which IIRC is tightly bound to the idea behind its
GC>   std::vector<double> data_values_;
GC> member--that 'double' is the universal datatype that can hold anything
GC> except a string. I shudder to think of replacing it with std::variant.
GC> I know I can assign
GC>   data_values_[some_index] = (bool) false;
GC> or 3.14f or 720, and it'll all just work. But
GC>   data_values_[some_index] = "Some string.";
GC> just scares me. It don't imagine a trivial one-line change would just work:
GC> -  std::vector<double> data_values_;
GC> +  std::vector<std::variant<whatever>> data_values_;
GC> How much more existing code would have to be changed? What could go wrong?

 It definitely won't be as simple as a one line change, in particular *all*
accesses to data_values_ would need to be changed, which is clearly an
appreciable amount of work. OTOH I don't think anything could really go
wrong with it, i.e. I believe the worst that could happen would be that we
could find bugs in the existing code, due to wrongly interpreting doubles
as integers/booleans or vice versa. If there are no such bugs now, I don't
think we could reasonably introduce them, as this change would improve
(well, add, really) type-safety compared to the current mostly typeless
code.

GC> I haven't lost the ability to introspect, so I recognize that my reaction
GC> is emotional and not logical. If I'm missing something wonderful in this
GC> discussion, it's probably exactly this.

 I do think std::variant<> should be used much more widely than it is,
although I'm not sure it's really so crucial in this particular case. But
generally speaking, part of appeal of the languages such as Haskell or Rust
to me is their support for sum types, i.e. (type-safe, of course, because
everything is type-safe in these languages) unions. C++ is quite unique
among purportedly high-level languages in not having sum types at the
language level and std::variant<> is clearly not as convenient as a
dedicated core feature, but it's the best approximation that we have and
it's still better to use it than relax type-safety completely, IMNSHO.


GC> And indeed one of my early ideas was just to do:
GC>   char* const policy_form_for_sample_product_generic {"UL32768"};
GC>   char* const policy_form_for_sample_product_idaho {"UL32768-ID"};
GC>   char* const policy_form_for_sample_product_texas {"UL-TX-1234-a"};
GC>   ...
GC> and then populate '.database' with addresses:
GC>   int dims[e_number_of_axes] = {1, 1, 1, 1, 1, e_max_dim_state, 1};
GC>   std::vector<int> dims(ptd, ptd + e_number_of_axes);
GC>   std::vector<double> policy_form(e_max_dim_state, 
(double)&policy_form_for_sample_product_generic);
GC>   policy_form[mce_s_ID] = (double)&policy_form_for_sample_product_idaho;
GC>   policy_form[mce_s_TX] = (double)&policy_form_for_sample_product_texas;
GC>   Add({DB_PolicyForm, dims, policy_form});
GC> so that we could retrieve 'PolicyForm' in context thus:
GC>   char const* const policy_form = b->database().query<char 
const*>(DB_PolicyForm);
GC> Isn't that pretty much the same thing?

 Sorry, no, I don't think so. This still uses a level of indirection, i.e.
double is used as an index into a vector. With storing data values directly
there is no double at all, just the string.

GC> > Also, in this particular case it doesn't seem too bad to do the
GC> > translation in the code like this because the strings are elements of
GC> > a small fixed set, i.e. the actual value is an enum, not a string.
GC> 
GC> What does "the actual value" mean? There are two cooperating types here:
GC>  - small fixed sets of short strings, from which one is chosen by
GC>  - an enumerative selector,
GC> so I'd say their combination is actually a compound of those two types.
GC> The selector resides in a '.database' file, and the switch statement
GC> assigns or returns a std::string that might otherwise reside in a
GC> '.policy' file (because it's a string).

 Sorry if I'm repeating this too much, but, just to be clear: I'd like to
get rid of the selector and put the string itself in the .database file
instead.

GC> > GC> Perhaps worse, in 'ledger_invariant_init.cpp', a
GC> > GC> 'PolicyForm' string is specified in each '.policy' file, but sometimes
GC> > GC> it must vary by the "state" axis, so we have this ad-hockery:
GC> > GC> 
GC> > GC>         bool alt_form = 
b->database().query<bool>(DB_UsePolicyFormAlt);
GC> > GC>         PolicyForm = p.datum(alt_form ? "PolicyFormAlternative" : 
"PolicyForm");
GC> > GC> 
GC> > GC> which allows one default 'PolicyForm' and exactly one alternative.
GC> > 
GC> >  Now that I understand how this works, it seems kind of ingenious,
GC> > actually.
GC> 
GC> You speak favorably of what I'm ashamed of. But maybe it's a wry
GC> sort of praise, meaning "Clever bastardization of that field (NOT)!",
GC> and the irony goes right over my head.

 I actually don't know how do I feel about it myself. It's clearly a hack
but it's an ingenious hack, so I can't prevent myself from liking it. I'm
still sane enough to understand that it would be better if it were not
necessary, however.

GC> >  So some strings would still remain in .policy files. What are they and 
how
GC> > are they different from the strings that will be moved? I.e. why do we 
need
GC> > both .policy and .lingo files, instead of just the latter?
GC> 
GC> In the original design, almost all strings used in reports were
GC> fixed (i.e., like any purely static content in today's MST files),
GC> and '.policy' files contained only (1) filenames (for '.database'
GC> files etc., and actuarial-table databases) and (2) a very few
GC> short substitutable strings like the insurance company's
GC> postal address. We'll certainly continue using them for purpose
GC> (1), but maybe that'll become their only purpose.

 It would make sense to me if .policy files contained only the really
constant strings, i.e. strings which don't, and can't, vary at all by any
of the axes.

GC> Everything else is just lingo.

 I.e., by this definition, "lingo" items are those that can (and maybe do)
vary across the axes.

GC> > GC>  - look up those strings by new '.database' keys, which can vary by 
multiple axes
GC> > 
GC> >  This makes sense, but I still don't see why is the extra level of
GC> > indirection needed in the first place, i.e. why can't we just put the
GC> > values in .lingo files directly into the .database files instead of using
GC> > the keys for them there.
GC> 
GC> Maybe tomorrow I'll be able to figure out what std::variant is.

 Sorry, perhaps I should have started by explaining it. Let me try to do
it, very briefly, now: while std::variant is more complicated and has some
special quirks, it's still basically the same thing as the plain old
"union" in C. The only (but significant) difference is that it's type-safe,
i.e. you can't store a double in it and then get back an int -- neither
accidentally nor even intentionally. But from high-level point of view,
it's just a union, i.e. you can put value of any of the types it supports
(as specified by its template arguments) into it and then (safely) get this
value back. You can also query it for the type of the value it supports
and, unlike actual unions, std::variant<> is a first class object in C++,
i.e. you can put it into vectors and other standard containers etc.

GC> Unlike std::optional, it's not nullable, so it's not immediately
GC> anathema to me. And it sounds like it's supposed to have almost
GC> no overhead cost, which is good; but in theory the same is true
GC> of a currency class, yet in practice, maybe not true.

 There are costs to using std::variant<>, in particular iterating over a
vector of variants will not be as efficient as iterating over a vector of
doubles. In the absolutely best case, a variant is still bigger than double
because it must be big enough to accommodate the biggest type in it, e.g.
std::string, which is typically 24 bytes, or 3*sizeof(double), so the data
density is 3 times less, meaning that less of it will fit into the cache.
I don't think it's going to be a problem in this case, however, but it
would definitely be worth measuring this.

GC> And it sounds like we'd need a std::visitor to use a std::variant,
GC> which seems like a design shortcoming.

 Not so much a shortcoming as a workaround for the lack of a "match"
statement at the language level (which might be finally addressed in C++23,
see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1371r3.pdf).
However you don't _need_ to use a visitor, it's just the best way to do it
if you don't know which type you've got inside the variant. If you do know
this, you should just access it directly.


 Anyhow, after spending all this time speaking about std::variant<>, I have
to also mention that I'm not even sure if it's really what we need to use
in database_entity. A union/variant is required when the same variable can
contain values of different types during its lifetime, but I don't think
this is ever the case for any database_entity objects: objects of this
class may have different types, but each of the individual objects is (or
at least seems to be, to me) always of the same type. So perhaps what we
really need is just database_entity<T>, where the existing class would
correspond to T=double case, but we could also have it for the other types?

 Of course, this solution is not "free" neither, as any functions currently
returning, or using, database_entity would need to become templates too.
But to me it would actually make sense to associate the type of the value
with every e_database_key and then ensure that we can only load the element
of the corresponding type from the database at _compile_ time. I could make
this more precise if you think this is worth pursuing, but, again, this
would almost certainly result in even more extensive changes than "just"
replacing "double" with "std::variant<double, ...>".


GC> Code like this, in production today:
GC>   bool alt_form = b->database().query<bool>(DB_UsePolicyFormAlt);
GC>   PolicyForm = p.datum(alt_form ? "PolicyFormAlternative" : "PolicyForm");
GC> is not wanted. Note carefully that "PolicyFormAlternative" and "PolicyForm"
GC> here are not arbitrary strings like "foo" and "bar" that merely suggest
GC> that actual values would differ--rather, they indicate these actual members:
GC>   product_data::PolicyForm;
GC>   product_data::PolicyFormAlternative;
GC> and they're in double quotes because of the MemberSymbolTable thing:
GC>   ascribe("PolicyForm"           , &product_data::PolicyForm           );
GC>   ascribe("PolicyFormAlternative", &product_data::PolicyFormAlternative);
GC> Adding one more alternative there would mean changing lmi code. Adding
GC> such an alternative for another '.product'-file entity would mean
GC> changing lmi code. That's what I want to avoid.
GC> 
GC> Instead, we want everything in tables, in external text files, which
GC> can be changed in a text editor without altering any lmi code.

 Amen. I definitely support this goal.

GC> >  In particular, what would we do if we need to use UL32768-I for the 
states
GC> > starting with the letter "I"? I thought the goal was to be able to add the
GC> > entries for Idaho, Illinois, Indiana and Iowa without changing lmi 
sources,
GC> > but with this approach it seems that we'd need to update them in order to
GC> > add e_policy_form_ID_IL_IN_IA. Is this really how it's supposed to work?
GC> 
GC> In a universe parallel to lmi's lives proprietary code that generates
GC> about a hundred different sets of '.policy', '.database',... '.strata'
GC> files. More often than not, we find it more convenient to change that
GC> proprietary C++ code than to edit the generated files directly,

 OK, I think this is the crucial piece which was obvious to you, but not to
me (even though you did already mention this in the past, and I should have
remembered it -- but I didn't, sorry). This probably explains why it would
be a bad idea to move all this stuff to the .database files, as it's not
"just" the files themselves, but all the code generating them that would
need to be changed.

GC> You ask whether we can add
GC>   > entries for Idaho, Illinois, Indiana and Iowa without changing lmi 
sources
GC> and I intend the answer to be "yes", without changing ***lmi*** sources,
GC> but rather by changing proprietary sources only. (That's an important
GC> distinction:

 Yes, and one which I totally missed, sorry again.

GC> I'd say it's just because I had assumed that the database_entity
GC> implementation was so tightly bound to the notion that it could
GC> hold only ever 'double' data without some vast change. Maybe
GC> std::variant is the magical answer to that

 I don't think it's magical, but then I'm a notorious killjoy and hardly
believe that magic exists at all. It does have advantages, but it would
require some work to implement and I'm not sure if the advantages are worth
it. For me, personally, storing axis-dependent data in the .database files
and providing it as a database_entity still seems like the most
straightforward and the "obviously correct" approach from lmi point of
view, but I don't have any visibility onto the code which generates the
data files, so I don't know if it's as natural when seen from that side.
I'm also not sure if the work needed to change database_entity to support
strings is worth the gain in simplicity/type-safety/whatever that we'd get
from it. I think this should be much more clear to you, so hopefully you
shouldn't have any trouble making the decision about it -- although I'd be
happy to answer any questions you might have, of course.

 Pragmatically speaking, if the schedule of your changes allows it, I think
it might be useful to try to explore changing the existing database
entities to use their "true type", i.e. make some of them bool or int
instead of having everything as double, without adding support for strings
or changing policy/database files at all. This would show how much of the
effort is needed for this and whether it's worth doing it. But, again,
maybe you already know the answer to this question, especially if it's
negative, in which case this would be useless.

 Please let me know if I can help you in any way, but I'm not sure I really
see what can I do, or even say, here.

 Regards,
VZ

Attachment: pgpg1fHreKe6n.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]